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Several  parallel  computer  systems  are  commercially  available  today.  They  could  be  divided  into 
three  main  classes  based  on  the  technique  used  to  connect  the  processing  and  memory  elements  of  the 
system  together  -  multistage  interconnection  network  (MIN)  based  systems,  bus  based  systems,  and 
hypercube  systems.  Commercial  examples  of  these  system  types  are  the  BBN  ACI  Butterfly,  the 
Encore  Corp.  Multimax,  and  the  Intel  Corp.  iPSC  respectively.  The  task  of  deciding  which  kind  of 
parallel  system  is  best  suited  for  a  particular  programming  application  domain  is  a  complex  one;  no  well 
defined  guidelines  or  decision  assisting  tools  are  currently  available.  Jhis  report  describes  a  series  of 
parallel  system  evaluation  efforts  being  conducted  with  the  BM/CJ  application  domain  in  mind. 
Emphasis  is  placed  on  the  BBN  ACI  Butterfly  Parallel  Processor. 

One  aspect  of  the  research  is  the  development  of  a  software  tool  that  can  be  used  to  conduct 
application  dependent  performance  evaluation  studies  on  the  Butterfly.  Called  the  Butterfly 
Performance  Predictor,  this  tool  consists  of  a  system  simulator  and  a  code  simulator.  Using  user 
provided  descriptions  of  the  algorithms  of  interest  in  terms  of  a  small  set  of  parameters,  (Cont'd) 
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19.  (Cont'd)  the  tool  generates  estimates  of  various  performance  based  on  basic  instruction  execution 
speeds  obtained  from  metrics  processor  data  books.  The  structure  of  the  tool,  which  is  under 
development,  is  described. 

A  second  aspect  of  the  research  effort  described  in  this  report  is  the  mapping  of  a  specific  battle 
management  algorithm  onto  the  Butterfly  parallel  processor.  The  goal  of  this  mappping  procedure  is  to 
minimize  the  amount  of  contention  for  shared  memory  and  communication  links  by  the  individual 
processing  components.  A  tree-shaped  process  structure  is  suggested  and  evaluated  using  a  simplified 
analytical  technique.  The  mapping  procedure  is  applicable  to  other  algorithms  with  similar  data  flow 
properties. 


No  performance  evaluation  study  would  be  truly  complete  without  a  study  of  the  dependability  of 
the  underlying  system.  None  of  the  existing  reliability  evaluation  tools  are  capable  of  computing  the 
reliability  of  the  Butterfly  or  Hypercube  systems.  An  analytical  model  for  computing  Butterfly 
reliability  is  presented.  The  model  is  based  on  the  decomposition  technique.  A  recursive  equation  is 
derived  to  compute  the  reliability  of  a  41  system  from  four  41"  subsystems.  Analytical  results  are 
given  for  16-node,  64-node,  and  256-node  Butterfly  configurations. 

A  new  analytical  technique  to  compute  the  reliability  of  n-dimensional  hypercube  systems  is  also 
described.  A  recursive  equation  is  derived  to  compute  the  n-cube  reliability  from  a  ?-cube  or  3-cube 
base  model.  Analytical  results  are  presented  for  up  to  8-dimensional  hypercubes- 
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CHAPTER  1 
INTRODUCTION 


^  The  availability  of  a  variety  of  commercial  multiprocessor  computers  today  makes 
it  difficult  to  decide  on  the  optimal  machine  for  any  specific  parallel  application  area. 
In  addition  to  the  strengths  and  weaknesses  of  each  candidate  machine,  the  charac¬ 
teristics  of  programs  in  the  target  application  domain  must  be  taken  into  account  in 
making  this  selection.  Unfortunately,  neither  formal  techniques  nor  software  tools  are 
currently  available  to  assist  in  this  decision  process.  The  development  effort  described 
in  this  report  addresses  the  problem  of  deciding  which  class  of  parallel  computer  sys¬ 
tems  is  best  suited  to  the  BM/C3  problem  domain.  In  particular,  this  report  describes 
the  development  of  software  tools  for  application  dependent  performance  and  depend¬ 
ability  estimation  of  available  parallel  computers. 

_  _  _  -  - - ' 

Most  commercially  available  parallel  computer  systems  can  be  classified  as  bus 

based  systems,  multi-stage  interconnection  network  (MIN)  based  systems,  or 
hypercube  systems.  Bus  based  multiprocessors  consist  of  processors,  memory  mod¬ 
ules  and  other  devices,  connected  to  each  other  through  a  simple  computer  bus.  Ex¬ 
amples  of  this  class  of  system  include  the  Encore  Multimax,  the  Sequent  Balance  and 
Symmetry  series,  and  the  Synapse  N+l  system,  with  new  products  announced  reg¬ 
ularly.  These  systems  are  typically  restricted  in  size  to  a  maximum  of  a  few  tens  of 
processors  due  to  the  performance  limitations  of  current  computer  buses.  In  MIN 
based  multiprocessors,  the  processors,  memory  modules  and  other  devices  are  con¬ 
nected  through  a  network  of  stages  of  switching  elements.  The  BBN  ACI  Butterfly 
system  is  one  commercially  available  example  of  a  MIN  based  multiprocessor.  The 
power  of  the  Butterfly  MIN  makes  multiprocessor  systems  with  hundreds  of  processors 
cost  effective.  Hypercube  multiprocessors  are  a  relatively  new  entrant  in  the  parallel 
processor  arena.  A  hypercube  system  consists  of  2**n  processor-memory  modules, 
with  each  module  directly  connected  to  n-1  neighbors,  forming  an  n-dimensional  cube. 


Hypercube  systems  with  up  to  1024  processor  modules  are  commercially  available  from 
Intel  (’orp.  XCIJBE  Corp,  and  Arnetek.  In  this  report  the  Butterfly.  Hvpercube.  and 
Multimax  machines  are  considerd  as  candidate  architectures  for  BM  C3  applications. 

A  major  goal  in  the  design  of  parallel  architectures  is  to  provide  high  comput¬ 
ing  power  with  assured  dependability.  High  computing  power  can  be  provided  by 
exploiting  the  parallelism  in  the  application  algorithms  and  by  mapping  these  parallel 
algorithms  onto  the  candidate  architectures.  Performance  evaluation  of  multiproces¬ 
sors  using  the  above  three  types  of  interconnection  topologies  have  been  addressed  by 
various  researchers  using  analytic  and  simulation  models  Bliuyan  84,  Das  85,  Dias 
81.  Kruskal  83.  Lang  82.  Lin  88,  Marsan  82.  Mudge  84,  Read  87,  VVittie  81,  Wu  84 
However,  most  of  these  studies  are  restricted  to  evaluation  of  the  architecture.  Con¬ 
sideration  of  both  architecture  and  algorithms  in  performance  evaluation  has  received 
little  attention  to  date.  We  discuss  this  aspect  of  performance  evaluation  in  Section  2 
of  this  report. 

The  second  requirement,  ” assured  dependability'’  of  parallel  architectures,  stems 
from  the  critical  applications  in  which  these  machines  are  used.  The  performance  anal¬ 
ysis  of  the  parallel  systems  outlined  above  implicitly  assumes  that  the  components  of  a 
system  are  fault  free.  These  results  give  the  so  called  “ideal”  performance  of  a  system. 
However,  in  a  real  situation  the  components  of  a  system  fail  at  random  depending  on 
the  failure  rates  of  the  components.  At  the  system  level,  a  multiprocessor  consists  of 
two  subsystems.  One  subsystem  is  the  computation  facility  which  is  provided  by  pro¬ 
cessors  (nodes)  and  memories.  The  second  subsystem  is  the  communication  network, 
used  to  support  interprocessor  communication.  The  failure  of  a  processor  (node)  or 
a  memory  unit  reduces  the  hardware  resources  available  on  the  system.  The  failure 
of  the  interconnection  switches  or  links  degrades  the  communication  capability  of  the 
network.  All  these  faults  affect  the  dependability  and  performance  of  the  system  to 
varying  degrees.  A  common  approach  to  improve  the  fault-tolerance  of  these  parallel 
systems  is  to  provide  graceful  degradation  as  an  inherent  attribute  of  a  system. 


Following  Laprie  Laprie  82  ,  dependability  is  defined  as  “the  quality  of  service 


delivered  by  the  system  such  that  reliance  can  be  justifiably  placed  on  the  service.” 
Dependability  is  a  generic  concept  that  encompasses  reliability,  availability,  maintain¬ 
ability,  and  safety  as  distinct  facets  of  system  specification.  It  has  been  reported 
Avizienis  78  that  there  is  a  clear  need  for  quantitative  measurement  of  dependability 
parameters. 

At  the  system  level  specification,  fault-tolerant  systems  are  categorized  as  either 
highly  reliable  or  highly  available  [Siewiorek  82].  Most  of  the  work  in  fault-tolerant 
evaluation  of  parallel  computers  is  confined  to  reliability  modeling.  This  is  mainly 
because  availability  evaluation  is  more  complex  than  reliability  evaluation. 

Reliability  evaluation  of  parallel  systems  has  been  studied  under  two  different  ap¬ 
proaches,  namely:  terminal  reliability  and  task  based  reliability  [Ingle  77,  Raghavendra 
84].  Terminal  reliability  is  defined  as  the  probability  that  at  least  one  communication 
path  exists  between  a  pair  of  nodes,  his  may  be  an  oversimplified  estimate  for  par¬ 
allel  systems  where  a  job  (task)  is  executed  concurrently  over  several  nodes.  The  task 
based  reliability,  on  the  other  hand,  assumes  that  a  system  remains  operational  as  long 
as  a  task  ran  be  executed  with  the  available  resources  on  the  system.  This  is  a  more 
appropriate  measure  of  reliability  in  a  parallel  processing  domain. 

Task  based  dependability  evaluation  of  some  parallel  computers  have  been  ad¬ 
dressed  by  diferent  researchers  [Arlat  83,  Das  85,  Das  87,  Hwang  82,  Ingle  77).  These 
studies  are  not  complete  from  different  perspectives.  For  example,  none  of  the  mod- 
.  —  combine  architecture,  algorithm  requirements,  and  software  issues,  to  model  the 
system  behavior  completely.  This  has  been  handicapped  mostly  due  to  the  complex¬ 
ity  of  the  parallel  machine  architecture.  In  particular,  the  exact  reliability  modeling 
of  the  communication  networks,  such  as  the  MIN,  is  quite  complex  and  can  lead  to 
NP-hard  problems  [Provan  86].  Hence,  very  little  research  effort  has  been  directed  to 
model  the  dependability  of  parallel  computers  combining  both  the  computation  and 
communication  degradation. 

While  classical  dependability  measures  such  as  reliability  and  availability  are  suit¬ 
able  to  evaluate  uniprocessor  systems,  these  measures  may  not  be  good  indicators  of 


parallel  system  behavior.  Dependability  measures  specify  only  the  operational  sta¬ 
tus  of  a  system  at  any  time  t.  No  performance  statistics  can  be  gathered  from  the 
reliability  or  availability  study.  High  performance  being  the  main  objective  of  paral¬ 
lel  architectures,  performance-related-dependability  evaluation  is  essential  to  evaluate 
these  architectures.  This  evaluation  will  specify,  for  example,  the  execution  time  of  a 
job  in  a  real  environment  when  all  kinds  of  component  failure  and  repairs  are  possible. 

Performance  related  dependability  measures  are  relatively  new  compared  to  clas¬ 
sical  dependability  theory.  A  number  of  performance-related  dependability  measures 
such  as  computation  reliability,  computation  availability  Beaudry  78..  performability 
Meyer  80i.  capacity  and  workload  characterization  Gay  79  .  have  been  proposed  for 
degradable  multiprocessors.  However,  none  of  these  models  have  been  applied  in  a 
real  sense  to  the  candidate  architectures  in  consideration. 

There  are  several  automatic  program  packages  such  a  ARIES  [Makam  82] ,  CARE 
III  Stiffler  82;,  HARP  ,Geist  83  ,  SAVE  Goyal  87',  and  SHARPE  |Sahner  87]  available 
for  computing  the  dependability  of  complex  systems.  Markov  models  of  a  system  are 
used  to  compute  the  reliability/availability  of  the  system  using  numerical  techniques. 
However,  these  packages  are  not  general  enough  to  handle  the  parallel  architectures 
under  investigat  ion.  The  difficulty  lies  in  generating  the  Markov  states  of  a  system  such 
as  the  Butterfly  or  Hypercube.  To  our  knowledeg  there  is  no  tool  available  today  that 
can  generate  the  Markov  chain  of  the  above  systems  automatically.  The  capabilities 
and  weaknesses  of  some  of  the  packages  are  reported  in  Section  4. 


CHAPTER  2 

PERFORMANCE  EVALUATION  OF  PARALLEL  ARCHITECTURES 


It  was  decided  to  commence  this  study  by  concentrating  on  the  MIN  based  But¬ 
terfly  parallel  processor.  This  section  describes  the  development  of  tools  to  assist  in 
the  evaluation  of  the  performance  of  such  a  MIN  based  computer  system;  these  tools 
are  referred  to  as  the  Butterfly  Performance  Predictor. 

2.1.  Butterfly  Performance  Predictor  Overview 

The  general  operation  of  the  Performance  Predictor  is  illustrated  in  Figure  2.1.  It  esti¬ 
mates  performance  metrics  based  on  two  kinds  of  data:  architectural  parameters  (de¬ 
tailed  information  about  the  parallel  machine  architecture),  and  algorithm  parameters 
(information  about  algorithms  from  the  application  domain  under  study).  Theoreti¬ 
cally,  such  a  performance  predictor  could  be  used  to  study  different  parallel  processor 
architectures  by  merely  varying  the  architectural  parameters.  In  practice,  it  is  difficult 
to  conceive  of  a  set  of  parameters  powerful  enough  to  categorize  bus-based  systems, 
MIN  based  systems,  and  hypercube  systems  in  sufficient  detail  to  allow  reasonable 
accuracy  of  performance  prediction.  A  more  conservative  design  goal  was  employed 
in  this  effort;  the  architectural  parameters  were  chosen  to  enable  the  user  to  study 
parallel  machines  “similar”  to  the  Butterfly. 

Figure  2.2  shows  the  Performance  Predictor  in  more  detail;  its  main  component  is 
a  Butterfly  Simulator  -  a  program  that  simulates  program  execution  on  a  Butterfly 
while  accumulating  performance  measures.  To  drive  the  simulator  under  conditions 
representative  of  the  target  application  domain,  two  strategies  are  considered.  In  the 
first  strategy,  real  Butterfly  programs  are  used.  This  scheme  has  obvious  drawbacks:  it 
requires  the  simulator  to  be  sophisticated  enough  to  process  actual  Butterfly  machine 
code  and  also  requires  access  to  BM/C3  programs  coded  specifically  for  the  Butterfly. 
A  more  flexible  and  user-friendly  strategy  is  to  drive  the  simulator  with  synthetically 


Algorithm  Characteristics 


Actual  Butterfly  Programs 


generated  instruction  streams  that  art'  representative  of  the  target  application  domain. 
The  second  key  component  of  the  Performance  Predictor,  therefore,  is  a  program 
(called  the  Code  Simulator)  that  generates  these  instruction  streams. 

2.2.  Butterfly  Parallel  Processor 

The  Butterfly  multiprocessor  system  is  made  up  of  processor  nodes  and  a  Butterfly 
interconnection  network  as  shown  in  Figure  2.3.  The  network  is  depicted  as  a  cylinder 
sunce  Both  it"  inputs  and  outputs  are  processor  nodes,  unlike  a  conventional  ‘"dance- 
hall"  multiprocessor  architecture,  which  would  have  processors  at  one  end  and  memory 
modules  at  the  other.  All  of  the  distributed  memory  is  globally  accessible.  Remote 
memory  accesses  are  conducted  through  the  network.  Each  processor  node  contains 
a  processor  (currently  a  Motorola  68020),  an  arithmetic  co-processor  (MC68881),  1-4 
Megabytes  of  memory,  memory  management  hardware,  and  an  interface  to  the  net¬ 
work.  as  illustrated  in  Figure  2.4 

2.3.  Butterfly  Simulator 

i 

The  Butterfly  Simulator  is  to  contain  two  components:  a  network  simulator  and  a 
node  simulator.  The  network  simulator  maintains  the  status  of  the  Butetrfly  network 
while  producing  timing  estimates  of  how  long  it  takes  to  traverse  the  network.  The 
node  simulator  accepts  Buterfly  programs  as  input  and  estimates  their  execution  time. 
It  uses  the  network  simulator  for  timing  information  related  to  the  Butterfly  MIN,  and 
uses  a  set  of  files  of  architectural  information  to  do  its  own  timing  estimation.  These 
tiles  contain  the  “architectural  parameters”  mentioned  earlier  in  this  report,  and  are 
referred  to  as  the  processor  files. 

The  program  execution  timing  estimates  are  made  at  the  instruction  level.  The 
execution  of  the  program  is  traces  instruction  by  instruction,  and  the  time  taken  for 
each  instruction  is  computed  based  on  timing  information  obtained  from  Motorola 
data  books  for  the  MC68020  and  the  MC68881  and  accumulated  in  the  Performance 
Predictor’s  processor  files:  these  files  contain,  for  each  instruction-addressing  mode 
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pair,  the  time  that  it  takes  to  execute  the  instruction  on  the  processor.  Figure  2.5 
shows  sample  processor  files. 

This  approach  has  one  serious  drawback  -  it  can  not  take  data-dependent  (con¬ 
ditional)  branches  in:o  account  in  producing  its  timing  estimates.  This  would  have 
been  possible  if  the  Butterfly  simulator  simulated  the  actual  execution  of  the  input 
Butterfly  programs,  which  would  be  slow,  costly,  and  difficult  to  implement.  As  an 
alternative,  statistics  from  the  literature  on  research  into  branch  prediction  DeRosa 
87.  Lee  84.  McFariing  86.  Smith  81  were  utilized  to  take  conditional  branches  into 
account.  The  approach  used  in  the  first  version  of  the  Butterfly  simulator  was  as  fol¬ 
lows:  on  encountering  a  given  conditional  branch  for  the  first  time,  it  would  be  taken 
with  a  probability  of  0.5.  When  the  same  conditional  branch  is  encountered  again  in 
the  processing  of  the  program,  the  simulator  assumes  that  the  branch  goes  the  same 
way  as  it  did  the  previous  time  with  a  probability  of  0.9:  it  goes  the  opposite  way  with 
a  probability  of  0.1. 

The  first  version  of  the  Butterfly  simulator  is  under  development  in  the  program¬ 
ming  language  C  on  a  SUN  3/50  workstation  running  4. 2BSD  UNIX.  The  development 
has  proceeded  as  follows:  a  timing  simulator  for  a  single  MC68020  was  first  developed, 
and  extended  into  a  simulator  for  multiple  68020s  with  one  task  running  on  each,  by 
early  September  1987.  This  simulator  is  now  being  extended  to  time  multiple  tasks 
running  on  multiple  MC68020s  closer  to  the  actual  Butterfly  environment.  At  the 
same  time,  efforts  to  refine  the  timing  estimation  procedure  are  underway,  as  are  efforts 
to  condense  the  processor  information  files,  which  currently  occupy  several  Megabytes 
of  disk  file  space. 

The  only  performance  metric  that  the  Butterfly  simulator  currently  measures  is 
total  execution  time.  Accumulation  of  other  metrics,  such  as  MIPS  (millions  of  of 
instructions  executed  per  second),  MFLOPS  (millions  of  floating  point  instructions 
executed  per  second),  processor  idle  time,  and  network  related  metrics  are  also  being 
incorporated. 


2.4.  Butterfly  Network  Simulator 


Processor  File  Example: 


i 

16%  clock  rate  in  nanoseconds  i 

**  %  section  delimiter 

ari%  section  with  mnemonics  for  operand  modes  J 

ard  i 

I 

arid  ] 

** 

move  %  section  with  two  operand  instructions 

add 

sub 

** 

neg  %  section  with  one  operand  instructions 
load 

** 

nop  4  %  section  with  no  operand  instruction  and  times 
** 

bcc  10  15  %  section  with  conditional  branch  instructions  and  times 
** 

bra  10  %  section  with  unconditional  branch  instructions  and  times 
** 

jsr  10  %  section  with  subroutine  call  instructions  and  times 
** 

rtr  5  %  section  with  subroutine  return  instructions  and  times 
** 

frk  20  %  section  with  fork  instruction(s)  and  time(s) 

** 

snd  8  %  section  with  send  ins  true  tion(s)  and  time(s) 

** 

rev  3  %  section  with  receive  instruction(s)  and  times 
EOF 

Figure  2.5  Sample  Processor  File 
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Instruction  Time  File  Example: 


move  %  insruction  mnemonic 
3,4,5;6,7,8; . . . ;  7,8,9; 

3,4,5;6,7,8; . . . ;  7,8,9; 
add 

3,4,5;2,4,8; . . . ;  7,8,9; 


i** 

S: 

i* 


23,24,25;36,37,38; . . . ;  57,58,59; 


**  %  section  for  one  operand  instructions 

neg 

3,4,5; 

5,6,8; 


V 
U' 

V 

ff 


1 

I 


Input  Program  File  Example: 


loop:  move  ari,  ard 
add  arid,  ari 
bcc  loop 
bsr  inc 
jmp  end 

inc:  add  ard,  arid 


end:  nop 
EOP 


Figure  2.5  (Corn'd)  Sample  Data  Files 
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The  objective  of  this  part  of  the  simulator  project  is  to  implement  a  software  module 
in  C  to  simulate  the  dynamics  of  the  Butterfly  network.  The  module  is  intended  to 
interface  with  the  node  simulator.  The  two  modules  interact  in  a  function  call  manner. 
At  simulation  time,  the  node  simulator  keeps  issuing  requests  for  network  services 
by  calling  the  network  simulator.  These  requests  correspond  to  non-local  memory 
references  generated  in  chronological  order  by  processor  nodes  of  the  Butterfly  parallel 
processor.  For  each  reqriest,  the  network  simulator  will  figure  out  the  response  time 
required  by  the  request  through  the  network  by  considering  latency  due  to  propagation 
delay  and  network  contention. 

The  network  simulator  comprises  two  major  components:  a  network  switching 
mechanism  and  a  conflict  resolution  mechanism.  The  former  is  used  for  directing 
a  request  for  accessing  a  specific  memory  through  the  switches  and  communication 
links  according  to  the  routing  rules  of  the  Butterfly  network.  The  latter  is  responsible 
for  detecting  network  contentions  where  many  requests  compete  for  the  same  switch 
outputs  or  communication  links,  and  for  arranging  them  in  order  through  the  outputs 
or  communication  links  in  contention.  Besides,  it  will  add  time  penalty  to  the  response 
time  of  a  deferred  request.  The  two  mechanisms  are  associated  with  two  essential  data 
structures  -  a  switch  matrix  and  a  collision  matrix  -  for  keeping  track  of  network 
status  and  for  recording  network  contentions,  respectively.  Both  are  of  the  form  of  a 
3-dimensional  array,  reflecting  the  topology  of  the  Butterfly  network. 

The  implementation  of  the  network  simulator  is  expected  to  complete  by  the  end 
of  December  1987. 

2.5.  Code  Simulator 

The  Performance  Predictor,  as  shown  in  Figure  2.2,  is  being  designed  to  accept  two 
*'orms  of  input:  actual  Butterfly  program  files,  and  synthetically  generated  program 
files  representative  of  algorithms  in  the  application  domain  of  interest.  The  generation 
of  these  synthetic  traces  is  the  duty  of  the  Code  Simulator. 

The  Code  Sin  'tor  is  still  under  development.  Its  operation  is  based  on  a  set 


of  input  parameters  that  characterizes  the  algorithms  of  interest.  Examples  of  such 
parameters  are  granularity,  parallelism,  and  communication/computation  ratio.  Gran¬ 
ularity  describes  the  size  of  the  individual  parallel  tasks  comprising  the  input  program; 
it  will  be  used  by  the  Code  Simulator  to  determine  how  large  the  synthetic  tarce  files 
are  to  be.  Degree  of  parallelism  describes  the  number  of  parallel  tasks;  the  Code 
Simulator  will  use  this  parameter  to  determine  how  many  synthetic  program  files  to 
generate,  as  well  as  in  task  activation.  The  computation  communication  ratio  would 
be  used  to  determine  how  many  computation  instructions  should  be  incorporated  per 
communication  instruction  in  the  synthectic  program  files.  Other  parameters  will 
clearly  b  eneeded  to  adequately  characterize  parallel  algorithms;  these  three  examples 
represent  a  starting  point. 

The  generation  of  synthetic  program  files  is  to  be  driven  by  tables  of  static  and 
dynamic  statistics  of  typical  high  level  language  program  contents.  These  tables 
have  been  compiled  based  on  the  vast  literature  on  instruction  execution  frequencies, 
operand  addressing  mode  frequencies,  instruction  transition  frequencies,  etc.  [Alexan¬ 
der  75,  Brookes  82,  DePrycker  82,  Ditzel  80,  Elshoff  76a,  Elshoff  76b,  Foster  71,  Knuth 
71.  Tanenbaum  78,\Viecek  82  .  Since  no  statistics  are  available  relating  directly  to  the 
MC68020  instruction  set,  these  tables  were  derived  based  on  equivalent  features  re¬ 
ported  for  other  systems  in  the  papers  mentioned  above. 


CHAPTER  3 

MAPPING  THE  BATTLE  MANAGEMENT  ALGORITHM 
TO  THE  BUTTERFLY  PARALLEL  PROCESSOR 
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This  chapter  is  concerned  with  mapping  the  Battle  Management  Algorithm  onto 
a  BBN  Butter  fly1  X1  shared-memory  multiprocessor.  An  efficient  mapping  method 
for  the  algorithm  is  presented.  The  proposed  method  in  fact  is  a  general  approach  to 
tailoring  and  fitting  a  class  of  numerical  and  non-numerical  algorithms  heavily  in  need 
of  global  search  and  broadcast,  into  the  Butterfly™  Parallel  Processor. 

It  is  known  that  the  overall  performance  of  a  parallel  algorithm  on  a  multipro¬ 
cessor  system  depends  largely  on  how  well  the  communication  structure  of  a  parallel 
algorithm  is  matched  with  the  system  interconnection  structure.  In  a  shared-memory 
environment,  there  are  two  major  factors  that  have  adverse  effects  on  achieving  the 
match  of  the  two  structures.  These  two  factors  are:  (1)  contentions  in  shared  memo¬ 
ries,  and  (2)  conflicts  in  communication  links.  Performance  analysis  on  a  Butterfly™ 
multiprocessor  has  been  presented  in  jCrowther  85],  |Tomas  86].  They  point  out  if  con¬ 
tention  problems  in  shared  memories  and  communication  links  become  dominant,  the 
speedup  curve  goes  to  saturation  as  more  processors  are  added.  LcBlanc  also  exam¬ 
ines  the  effect  of  memory  and  switch  contention  by  adding  extra  memories  and  extra 
switches  in  the  system  network  Leblanc  86],  He  concludes  that  an  implementation 
based  on  very  efficient  communication  (e.g.,  shared  memory)  may  perform  worse  than 
that  based  on  a  less  efficient  mechanism  if  such  efficiency  causes  too  much  communica¬ 
tion  overhead  due  to  memory  and  switch  contention.  Several  previous  works  have  been 
done  in  reducing  the  memory  contention  problems.  Worthy  of  notice  are  the  works 
done  in  IBM  RP3  and  NYU  Ultracomputer  jPfister  85],  ]Lee  86],  in  which  hardware 
message-combining  techniques  are  used.  Since  the  hardware  combining  networks  are 
expensive,  Yew  proposes  an  effective  software  combining  tree  for  decreasing  memory 
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contention  and  preventing  tree  saturation  in  the  interconnection  network  Yew  87  . 

We  propose  an  algorithm-based  method  for  reducing  the  above  contention  prob¬ 
lems  to  the  minimum.  Unlike  previous  works,  the  proposed  method  does  not  require 
hardware  augmentation  or  mediation  in  communication  networks.  Contention  costs 
both  in  shared  memories  and  in  communication  links  are  minimized  by  an  efficient 
mapping  method  with  two  different  phases  of  tree-shape  communication  structures, 
one  for  searching  and  the  other  for  broadcasting.  The  tree  structures  allow  us  to 
rapidly  determine  and  broadcast  a  critical  data  without  concern  for  memory  and  link 
contentions. 

3.1.  Characteristics  of  The  Butterfly  Network 

The  core  of  the  Butter  fly™  Parallel  Processor  is  a  multistage  switching  network, 
called  the  Butterfly  Network,  through  which  processor  nodes  access  remote  memories 
in  a  packet  switching  manner.  Major  characteristics  of  a  Butterfly  Network  with  2m 
inputs  and  2m  outputs,  where  rn  is  a  positive  even  number,  are  enumerated  below: 

(1)  the  number  of  stages  log4 2m  ~  y  , 

(2)  the  number  of  SEs  in  one  stage  -  ,  and 

(3)  the  total  number  of  SEs  -  ™  *  2”-  =  m  *  2m~3. 

If  we  let  N  2m  represent  the  number  of  processors,  the  switch  has  the  advantages 
that  the  total  number  of  SEs  needed  is  0(Nlog4N)  and  the  bandwddth  of  the  network 
is  O(N).  Figure  3.1  shows  a  spacial  case  of  the  BBN  Butterfly™  parallel  processor 
with  m  -  4. 

An  m-bit  binary  representation  of  the  source  nodes  (processors)  and  destination 
nodes  (memories)  of  the  Butterfly  network  can  be  expressed  as  follows: 

-  1  2-  I  GO 

where  rn  4,  6,  8,  ....  The  establishment  of  a  connection  from  a  source(S)  to  its 
destination(D)  is  based  on  a  self-routing  scheme.  That  is,  to  establish  a  connection 
from  S  to  D,  the  binary  address  of  D  is  used  as  a  routing  tag  to  direct  the  connection.  If 
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wo  lot  S  |S„}  and  D  dm  jdm_2....d1d0.  every  two  binary  bits  <f2^i^2i 

is  corresponding  to  the  sotting  of  the  switching  element  at  stage  i,  where  i=  0,  1,  2, 

(  1).  is  the  stage  number.  The  communication  link  traversed  by  the  connection 

from  source  to  destination  at  stage  ?  is  described  as 


whore  0  •  i  ■  ™  1. 


3.2.  Problem  Formulation  and  Algorithm  Parallelism 

3.2.1  Nature  of  The  Battle  Management  Algorithm 

The  Battle  Management  Algorithm  belongs  to  a  class  of  linear  programming  prob¬ 
lems.  These  problems  are  concerned  with  the  optimization  of  a  set  of  linear  functions 
subject  to  some  linear  constraints  and  to  the  condition  that  all  the  variables  must  as¬ 
sume  nonnegative  values.  The  function  to  be  optimized  is  called  the  objective  function. 
For  example,  a  general  formulation  of  linear  programming  problems  is  as  follows:  To 
optimize  the  objective  function 

('  !().oAo  ;  1  0,1  A’ 1  -t  ”•  +  FodA'd  +  ...  +  V0,rXr 

subject  to  the  linear  constraints 

It.oAo  "  F  )  i i  •  ...  r  V\4Xd  +  ...  +  V\<rXr{<,  =,  >}FFi; 
l  2.o A o  1 2,i  A  |  ■+■  ...  +  V24Xd  +  ...  +  V2,rXr{<,  =,  >}VF2; 


Vfc.o-Vo  -  Vfc.,  A,  -  ...  4  Vk4Xd  4  ...  +  Vk,rXr{<,=,>}Wk-, 


Fn  i.oAo  *  Vn  i,i  A' i  +  ...  ■*  Vn~i4X d  F  ...  +  Pn-i,rA’r{<,  =,  >}Wn_i; 
and  to  the  nonnegative  condition  A'o  >  0,  Xx  >  0,  ...,  Xr  >  0. 

A  set  of  values  of  the  variables  A”o,  Ari , ...,  Xr  that  satisfy  the  linear  constraints 
and  the  nonnegative  condition  is  called  a  feasible  solution.  A  feasible  solution  that 
can  optimizes  the  objective  function  is  called  an  optimal  feasible  solution.  The  region 


that  contains  all  the  feasible  solution  is  called  feasible  region,  which  is  always  a  convex 
polygon. 

For  the  problems  of  optimizing  linear  functions  subject  to  linear  constraints,  it 
has  been  known  that  an  optimal  feasible  solution  is  always  at  a  vertex  of  the  feasible 
region.  Hence,  we  need  to  examine  the  value  of  the  objective  function  af  each  vertex 
of  the  feasible  region.  The  problem  of  finding  an  optimal  feasible  solution  may  become 
a  very  tedious  job,  as  the  number  of  variables  and  the  number  of  linear  constraints 
increase.  Simplex  method  is  one  of  the  most  useful  methods  for  finding  an  optimal 
feasible  solution  without  examining  exhaustively  the  values  of  the  objective  function 
at  all  vertices.  It  is  an  iterative  search  procedure.  Starting  from  an  artificial  candidate 
solution,  it  first  finds  a  basic  feasible  solution,  which  is  represented  by  a  vertex  of 
the  convex  polygon.  From  there,  it  searches  for  another  vertex  at  which  the  value 
of  the  objective  function  will  be  improved.  This  search  is  repeated  iteratively  until 
the  optimal  solution  is  found.  Since  there  are  only  a  finite  number  of  vertices,  and 
the  objective  function  value  is  improved  everytime  a  new  vertex  is  reached,  the  search 
process  will  eventually  converge  to  the  optimal  solution. 

The  theory  of  searching  another  vertex  at  which  the  value  of  the  objective  function 
is  better,  and  of  knowing  an  optimal  solution  has  been  reached  are  described  in  detail  in 
Liu  68,.  Here,  we  only  focus  on  the  procedures  of  the  simplex  me'hod.  The  objective 
function  and  linear  constraints  can  be  rewritten  as  follows. 

C  Vo, 0A0  Fo,i-Yi  •••  V 0,<iXd  ■■■  ~  Vo,rXr  —  WV, 

Xr+  1  +  Ti  o-To  t  V'i  j.Yi  -t  ...  *  \  \,dXj  *-  ...  -t  V\rXr  =  W\ ; 

Xr  +  2  -r  V2,0X0  +'  ^2,1^1  t  ...  +-  \  2,dArf  +  ...  -*  V2tr-Yr  -  ^  2-i 


Xr 4- jt  4-  +  ...  *  Vfc  dAj  x  ...  +  b’fc  r A"r  - 


Xr  f  n _  i  +  Vn-i,oXo  +Vi,nV,  f  *■  i.jA’d  4  ...  t  Vn  \iTXr  -  Wn  i ; 
where  VT0  -  0,  and  A'r+1,  A%^2,  ....  Ar,„  i  are  the  added  slack  variables  (or  basic 
variables,  initially),  A'o,  AT],  ...,  Xr  are  the  non-basic  variables.  Except  the  coefficients 


of  the  slack  variables,  an  nx(r  •  2)  coefficient  matrix  is  generated. 
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For  coefficients  of  the  equation  of  objective  function,  any  negative  value  will  pro¬ 
duce  one  pivot  column.  For  example,  if  \'o.d  <  0-  then  the  d-th  column  is  called 

the  pivot  column.  To  determine  which  of  the  variables  A'r_  i.  AV-*  2 . A'r .  „  .  ,  will 

become  a  non-basic  variable,  the  ratios  1  .  ,u*  .  *  .  ,U "  1  are  computed. 

Suppose  the  ratio  is  the  smallest  of  all  the  positive  quantities,  then  the  coefficient 
Yk,d  is  called  the  pivot.  The  row  that  contains  the  pivot  is  called  the  pivot  row.  As 
soon  as  the  pivot  is  determined,  the  operations  are  continued  as  follows. 

(1)  The  pivot  is  replaced  by  its  reciprocal. 

(2)  The  other  entries  in  the  pivot  row  are  divided  by  the  pivot. 

(3)  The  other  entries  in  the  pivot  column  are  divided  by  the  pivot  with  their  signs 
reversed. 

(4)  For  the  other  entries.  Vt}  ( 1  4  k,j±  d )  is  replaced  by  W,  (1  ^  k) 

is  replaced  by  IV,  -  VV*  *  . 

The  above  operations  complete  one  iteration.  The  iterations  are  continued  until 
the  coefficients  in  the  expression  for  the  objective  function  are  all  positive.  At  this 
point,  the  optimal  solution  is  found. 
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3.2.2  Task  Decomposition 

Th  is  subsection  deals  with  decomposing  the  sequential  algorithm  of  the  simplex 
method  into  several  concurrently  executed  subtasks,  then  these  subtasks  are  assigned 
to  the  processors  on  Butterfly™  network.  In  general,  there  are  two  correlative  fac¬ 
tors  influencing  the  decomposition  :  granularity  and  interprocessor  communication 
cost.  To  achieve  a  maximum  degree  of  parallelism,  we  attempt  to  distribute  compu¬ 
tations  to  as  many  processors  as  possible  fine  grain.  However,  overhead  due  to 


interprocessor  romnmnication  drives  the  tasks  allocation  strategy  to  cluster  modules 
to  as  few  processors  as  possible  large  gram.  Obviously,  it  is  not  easy  to  satisfy  these 
two  conflicting  factors  simultaneously:  therefore,  a  compromise  must  be  made  to  find 
the  optimal  arrangement  for  a  task  such  that  the  maximum  system  performance  can 
be  achieved. 

We  illustrate  the  data  flow  and  process  structure  of  the  simplex  method  in  Figure 
3.2.  In  each  iteration,  a  pivot  column  is  arbitrarily  selected  corresponding  to  one 
negative  coefficient  of  the  objective  equation.  The  following  parallel  algorithms  consist 
of  four  sequential  computational  phases  in  one  iteration: 

Phase  I:  To  compute  all  the  ratios  simultaneously  by  accessing  the  local  data. 

Phase  2:  To  determine  the  smallest  ratio  and  the  pivot 

Phase  3:  To  modify  the  data  elements  on  pivot  column  by  accessing  the  pivot. 

Phase  4:  To  modify  the  data  elements  other  than  those  on  pivot  column. 

To  obtain  the  optimal  balance  in  the  competition  between  granularity  and  inter- 
processor  communication  cost,  we  arrange  one  row  elements  of  the  coefficient  matrix 
to  be  performed  by  one  processor  as  shown  in  Figure  3.2.  Hence,  we  are  dealing  with 
mapping  a  problem  with  nx(r  *  2)  data  elements  onto  a  Butterfly™  network  with 
2rn  processors  and  2rn  shared  memories.  The  relation  between  the  problem  size  and 


the  network  size  is 


2m  2 <  n  '  2m. 


Before  the  computation  starts,  we  assume  data  elements  are  assigned  to  processors 
as  the  following  way:  The  data  elements  on  row  0.  row  1,  ...,  row  (n-1)  as  shown  in 
the  coefficient  matrix  are  assigned  to  the  processors  Pq ,  P i,  ..  ,  Pn- 1  on  the  Butterfly 
network  respectively.  That  means,  the  data  elements  on  row'  0,  row  1,  ...,  row  (n-l) 
are  stored  in  the  memory  modules  M0-  Mi,  •••,  A tn  l  respectively. 


3.3.  Algorithm  Mapping 

A  significant  aspect  of  parallel  algorithms  is  that  in  many  cases,  the  model  on 
which  they  run  is  not  physically  realizable  directly  in  present  day  hardware.  Typically, 
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for  an  ideal  parallel  computer,  each  processor  can  access  (read  from  or  write  into)  one 
memory  in  one  step.  Simultaneous  read  or  write  on  a  memory  by  more  than  one 
processor  may  result  in  competition  problems  in  that  memory  and  in  communication 
links.  As  far  as  a  better  system  performance  is  concerned,  hence,  it  is  highly  demanded 
that  an  efficient  mapping  for  the  linear  programming  algorithms  on  the  Butter  fly™ 
network  should  be  designed  to  reduce  the  contention  problems  both  in  shared  memories 
and  communication  links. 

Based  on  the  fact  that  the  same  arithmetic  operations  of  the  linear  programming 
algorithms  always  reside  in  the  same  level,  in  the  following  we  design  the  conflict-free 
connection  strategies  to  prevent  from  collision  of  the  communication  links  at  the  same 
level. 

3.3.1  Conflict-Free  Connections 

Prior  to  describing  the  parallel  algorithm  and  mapping  strategies,  we  first  intro¬ 
duce  Definitions  and  Theorems  relevant  to  the  algorithm  mapping.  A  routing  scheme 
must  be  used  to  set  up  the  connections  between  processors  and  shared  memories. 
However,  the  simultaneous  connections  may  result  in  conflicts,  since  the  Butterfly™ 
network  belongs  to  a  class  of  blocking  interconnection  network. 

Definition:  A  connection  conflict  is  defined  as  a  situation  in  which  two  connections 
use  the  same  communication  links  at  some  stages  at  the  same  time.  On  the  other  hand, 
two  connections  which  do  not  result  in  connection  conflicts  are  said  to  be  conflict-free. 

Definition:  Let  X  and  U  represent  two  different  n.-bit  binary  numbers,  then  ^>(X,U) 
is  the  maximum  number  of  consecutively  two  identical  low-order  bits  of  X  and  U. 

For  example,  if  we  consider  m=6,  the  number  of  stage- m/ 2  -3,  and  X=01  10  10, 
U  10  10  10,  then  p(X,U)--2. 

Theorem  3.1:  In  a  BBN  Butter  fly™  network  of  size  N-  2m  with  the  number  of 
stage  j .  two  connections  X  -»  Y  and  U  *  V  (X  f  U  and  Y  f  V)  are  conflict-free  if 


proof:  The  communication  links  traversed  by  X  --*■  Y  and  U  V  at  stage  i  are 
described  as 
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(XiXoX3X2...Xm-l-2*Zm-2-2*J/2*-  1  J/2*  -  2  •  •  -2/3J/2J/1  J/o),  and 
(uiU0U3U2...Um_  l-2iUm-2-2i«2t-  1^2i  -  2— V3r2t»i  V0)f- 
respectively,  where  0  <  i  <  y  —  1. 

For  sufficient  condition:  since  the  two  connections  X  -»  Y  and  U  — *  V  are  conflict-free, 
at  every  stage  i, 

(•Tl-r0-r3^2--  -rm  1  -  2*  -^m  -  2  —  2*2/2*  —  1  J/2*  —  2  J/3  J/2  J/l  J/o), 

*  {uiU0U’V2...Vm  1  -  2tum  —2  —  2iv2t—  1  ^2t  —  2  ••-l’3l’2l‘l  uo),  • 

This  implies 

(■^1  J’oJ'  3X2---X  m  -  1  2  — 2i)  7^  (  Uj  Uo  U3U2...Um  _  ]  _2,Um-  2-2* )  °r 

(j/2.- lJ/2.-2---J/3j/2yiyo)  #  (t’2,-1^2«-2  —  V3V2^lUo)- 
Hence,  we  have  <p(X,U)  +  <p(Y,  V)  <  y. 

For  necessary  condition:  we  assume  i p(X,U)  =  k  and  <p(Y,V)  =  q,  then  k  f  q  <  y . 

(1)  if  i  =  0  or  y  -  1,  then 

(X\XQX3X2  —  Xm-  1 -2»'*m-2- 2*'J/2*'-  1  S/2*  —  2  *  • -1/3  J/2  J/l  I/o),  # 

{uiu0u3U2...um  -i-2iWm-2-2«V2«-it;2»-2  -V3t’2ViVo)l  since  X  ^  U  and  Y  f  V. 

(2)  if  1  <  ?  <  9,  then 

( 2/2 1  —  1 3/2 1  2  •  -  -  2/3  3/23/1  3/0  )  =  (v2«-lV2«-2--V3V2^lVo). 

Since  <p(X,U)  -fr,  and  fc  <  y  —  9  <  y  -  *,  we  have 

(xiXoX3J2--J"m- l-2»Zm-2-2*)  7^  (« i  UQU3U2---Um- ]  _  2,  «m- 2- 2*)- 

(3)  if  9  <  i  <  y  -  1,  then 

(3/2«  —  1  3/2*  —  2  ■  •  •  J/3  J/2  J/l  3/0  )  (t'2«-lU2«-2---^3^2Vl  Vo)- 

Consequently,  at  every  stage  i, 

(xiXol3X2---Xm_i_2iXm_  2-2*3/21-  1  J/2* -2-"  J/3  J/2  J/l  J/o), 

7^  («1  UoU3U2...Um_i_2*  Vm-2-2«V2i  -  I  V2,-2---V3V2V]  Vo), 

Hence,  X  Y  and  U  — >  V  are  conflict-free.  □ 

The  operation  of  sending  messages  from  processors  to  memories  (or  vice  versa)  is 
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called  a  permutation  routing.  Shortly,  we  relax  that  the  processor-to-memory  assign¬ 
ment  is  a  permutation. 


Definition:  A  permutation  of  bit  reversal  is  defined  as  that  a  binary  representation  is 
reversed  in  every  one  bit  unit.  For  example,  if  we  define  the  permutation  function  of 
bit  reversal  to  be  p.  an  m-bit  binary  number  A  —  am- iflm- 2Qm-3  "  <J2fliao,  can  be 
transformed  as  the  following  way,  p(A)  =  aoaia?  _3am- 20m- 1  • 

Theorem  3.2:  The  permutation  of  bit  reversal  ensures  that  the  communication  links 
are  conflict-free,  when  it  is  used  to  communicate  all  the  source  nodes  to  their  respective 
destination  nodes. 

Proof:  Assume  5,  and  S}  are  the  binary  representation  of  any  two  source  nodes, 

*F*  1  Q-m  2®m-3 — 

Sj  bTn  i  bfjx  —  2^m  -  3 * •  ■  ‘b^b 1 6o 

The  permutation  of  bit  reversal  completes  two  connections:  5,  •  D,  and  Sj  -  +  D}, 

we  have 

D\  p{Si)  Go®  1  02  -3flm- 2®m -  1 
Dj  p(S j)  b()b\b2----bm~3bm-  2  b  m  —  1 

Since  S,  /  Sj,  if  we  assume  p(St,Sj)  —  u,  then  0  <  w  <  ™  -  1.  This  allows, 

(1)  <p(D,,Dj)  0,  if  u>  -f  0  (Note  that  max{w}  =  ~  —  1),  or 

(2)  max{<p(Dl ,  D})}  -  ™  -  1.  if  ui  -  0 

In  either  case,  <, p(S,,S})  f  <p{Dt,D})  <  ».  Based  on  Theorem  3.1,  any  of  the  two 
connections  are  conflict-free.  □ 

We  now  describe  the  conflict-free  connection  strategies  as  follows.  In  the  begin¬ 
ning.  the  2m  processor  space  is  divided  into  two  subspaces,  source-1  and  destination- 1, 
each  space  has  2m_1  processors.  For  the  processors  in  the  source-1  space,  the  most 
significant  bit  of  binary  representation  is  zero.  On  the  other  hand,  the  processors  in 
the  destination- 1  space  have  the  most  significant  bit  one.  The  connections  between 
the  source-1  and  destination-1  space  are  established  by  performing  the  permutation  of 
bit  reversal  on  the  remaining  m  1  binary  bits  of  the  processors  in  the  source-1  space. 
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Secondly,  the  processors  in  destination-1  space  are  now  divided  into  another  two 
subspaces,  source-2  and  destination-2,  each  has  2m  2  processors.  For  the  processors  in 
the  source-2  space,  the  second  significant  bit  is  zero,  and  the  second  significant  bit  of 
the  processors  in  the  destination-2  space  is  one.  Similarly,  the  connections  between  the 
space  of  source-2  and  destination-2  are  established  by  performing  the  permutation  of 
bit  reversal  on  the  remaining  m  2  binary  bits  of  the  processors  in  the  source-2  space. 
The  similar  connection  strategies  are  continued  until  the  number  of  processors  in  the 
destination-m  space  is  equal  to  one.  Since  the  permutation  of  bit  reversal  ensures  a 
one-to-one  connection,  the  connections  established  as  above  also  preserve  the  property 
of  one-to-one  permutation. 

The  above  connection  strategies  can  be  formulated.  First  we  define  a  permuta¬ 
tion  ph  which  performs  partial  bit-reduction  and  partial  bit-reversal.  For  example,  if 
A  —  —  2"'^-rn~  ti@  m  h  ■  l •  •  .G  2G \ Qq .  then 

ph{A)  -  a0aia2...am  h-  i- 

Suppose  we  let  S,b  -*■  Dth  and  S}h  ->  Db  represent  any  two  connections  from  source- 
h  space  to  destination-fi  space.  The  m-bit  binary  representations  of  Sb ,  Sb,  Db  and 
Db  can  be  expressed  as  follows,  note  h  -  1,2,3, 

h  1 

Sb  11...  10am  h  i...a2aia0; 

h  1 

Sb  —  11  ..106m  h  ,...62Mo; 

h  h 

Db  TCTlPb(S?)  -  lirTlooa1a2...am_h_,; 

h  h 

Dh}  --  TCTlp»{Sb)  ---  ndl606i62...6m-h-i. 

If  h  m.  only  one  connection  is  established.  The  m-bit  binary  representation  of  Sb 
and  Db  become  S™  1111...  10  and  D™  -  1111...  11. 


Theorem  3.3:  For  every  h,  h  1  to  m,  the  connections  Sh  Dh  are  conflict-free. 


m 

j 

Wj 


8 


proof:  If  vve  let  Ah  ~  am  h  i..  O2flifl0  and  5^  =  bm  h  l  -  hbibo,  then  we  have 
ph(Sf)  -  p{Ah)  and  ph(Sf)  -  p(Bh).  Based  on  Theorem  3.1  and  3.2,  since  the  per¬ 
mutation  of  bit  reversal  ensures  that  the  connections  are  conflict-free  for  h  =  1,2, ...,  m, 


we  have 


>(Ah,Bh)  +  <p(p(Ah),p(Bh)) 


Also,  because  of  S,h  £  Sjh  and  Dth  D}h ,  the  following  two  equations  are  valid, 

(1)  P(S,\S/)  -  <p(Ah  ,Bh),  and 

(2)  MD,h.D}h)  {(p(Ah),p(Bh)). 

Hence,  we  have  f(S,h  <Sjh)  +  ^{D,h  ,D}h)  <  ™.  According  to  the  necessary  condition 
of  Theorem  3.1,  we  know  for  every  h,  the  connections  established  as  above  are  all 
conflict-free.  □ 

3.3.2  Reduction  of  Memory  Contentions 

In  shared-memory  environment,  memory  contention  problems  frequently  incur 
extra  execution  time  and  consequently  decay  the  system  performance.  Hence,  the 
contention  in  shared  memories  needs  to  be  reduced  in  addition  to  minimizing  the 
conflict  in  communication  links.  We  propose  the  tree-shape  communication  structure 
which  can  reduce  memory  contention  problems  from  0(n)  to  0(log2{n))  (where  n  is 
the  problem  size).  The  set  up  of  a  tree-shape  communication  structure  for  searching 
is  described  as  follows. 

A  deposit-access  mode ,  which  requires  only  one-path  communication  cost,  is  used 
for  processors  to  access  the  shared  memories  in  the  communication  structure.  For 
example,  if  Pt(Pk)  connected  to  the  local  memory  of  Pj{Pi)  by  the  conflict-free 
connection  strategies,  the  deposit- access  mode  means  P,(Pk)  will  fetch  the  compared 
result  from  its  local  memory  and  store  it,  through  the  interconnection  network,  to  the 
local  memory  of  P,{Pi ).  Pj(Pi)  performs  an  arithmetic  operation  to  compare  the  de¬ 
posited  result  by  P,(Pk)  with  its  own  result  both  stored  in  the  local  memory  of  P}(Pi). 
After  the  comparison  operation,  P}  in  turn  deposits  the  compared  result,  through  the 
interconnection  network,  to  the  local  memory  of  Pi  with  the  same  procedure.  Figure 
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3.3  shows  the  tree-shape  communication  structure  built  with  P,.P},Pk,  and  P;.  In 
this  case,  P,  and  P}  { Pk  and  P/)  are  called  pair  node,  and  P,  (Pt)  is  referred  as  left 
child,  Pj  (P/)  as  right  child.  In  our  design,  the  right  child  will  become  the  father  in 
next  level. 

Definition:  The  number  of  level  of  a  tree  structure  is  defined  as  the  distance  from  the 
farthest  leave  node  to  the  root.  Hence,  for  a  tree  structure  with  2m  nodes,  the  number 
of  levels  equals  m.  We  let  h  =  1,2,3,  ...,m  represent  each  level  of  the  tree  structure. 

The  subsequent  following  mapping  algorithms  serve  the  purpose  of  setting  up 
a  tree-shape  communication  structure  on  the  Butter flyT^  network.  Meanwhile,  at 
each  level,  the  conflict  in  communication  links  is  avoided  and  the  contention  in  shared 
memories  is  reduced. 

Algorithm  3.1:  To  compute  all  the  ratios  concurrently  for  a  picked  pivot  column. 
FOR  i  --  0  TO  n  1  (all  processors  execute  in  parallel) 

(1)  P,  reads  data  and  Wt  from  its  local  memory; 

(2)  P,  computes  the  ratio  = 

END;  D 

The  binary  representation  of  processor  P,  (i  =  0  to  n  -  1)  is  assumed  to  be 
«m- For  a  tree  level  h,  bit  am- h  (h  -  1,2 ,...,m)  is  an  indicator  to 
divide  the  processor  space  into  two  subspaces,  source  and  destination. 

(1)  If  am~h  =  0,  then  P,  is  in  the  source  subspace. 

(2)  If  am  h  -  1,  then  the  local  memory  of  Pi  (that  is,  M,)  is  in  the  destination 
subspace. 

The  source  node  P,  then  connects  to  its  destination  by  performing  the  permutation  of 
bit  reversal  on  its  remaining  binary  digits  a  m  -  h—  1  ®  m  &0  • 

Algorithm  3.2:  To  search  the  smallest  ratio  and  the  pivot,  we  build  the  searching 
tree  structure  from  the  bottom  level  {h  -  1)  to  the  top  level  ( h  =  m). 

Step  1:  For  level  h  =  1  on  the  tree-shape  structure 


0.  then  P,  connects  to  its  destination  by  performing  the  permuta- 


If  <l,n  1 

tion  of  bit  reversal  on  its  remaining  binary  digits  am  > a m  3...a\ao. 

Step  2:  For  level  h  --  2  on  the  tree-shape  structure 

The  destination  nodes  in  level  /<  -  1  now  are  divided  into  two  subspaces.  If 
a„,  >  0.  then  P,  connects  to  its  destination  by  performing  the  permutation 

of  bit  reversal  on  its  remaining  binary  digits  am  3am  -4...a1ao- 

Step  3:  Repeat  the  same  procedures  until  reach  level  h  ■■  m,  the  tree-shape 
communication  structure  can  be  set  up. 

Step  4:  Tin-  smallest  ratio  is  determined  from  the  root  processor,  and  the  pivot 
is  found  from  the  numerator  of  the  smallest  ratio.  The  pivot  row  can  also  be 
decided  from  the  pivot. 

Based  on  Theorem  3.3,  the  searching  tree  can  be  mapped  onto  the  Butterfly ™  net¬ 
work  without  any  conflict  in  communication  links  for  every  level  h,  and  the  contention 
in  shared  memories  is  reduced  to  0(log2n).  □ 

An  example  of  the  communication  structure  of  a  searching  tree  shown  in  Figure  3.4 
(the  communication  direction  is  indicated  by  upward  arrow)  is  built  w’ith  16  processor 
nodes  (m  4  and  n  16).  Once  the  tree-shape  communication  structure  has  been  set 
up(also.  the  pivot  has  been  determined),  the  second  computational  phase  is  concerned 
with  broadcasting  the  pivot  from  the  root  processor  to  all  other  memory  modules  such 
that  directly  access  for  other  processors  becomes  possible. 

Definition:  A  memory  replication  technique  is  a  technique  to  duplicate  a  critical  data 
in  as  many  memory  locations  as  needed  by  using  the  deposit-access  mode. 

The  same  tree-shape  communication  structure  built  in  Algorithm  3.2  is  used  to 
perform  the  memory  replication  just  reverse  the  access  procedure.  The  direction  of 
communication  is  indicated  by  the  downward  arrow  as  shown  in  Figure  3.4.  This  tree 
structure  is  referred  as  a  broadcasting  tree. 

Theorem  3.4:  A  broadcasting  tree  structure  can  be  mapped  onto  the  Butterfly™ 
network  with  conflict-free  connections  at  each  level,  on  the  condition  that  the  con- 
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◄ - ►  :  Remote  memory  access 

-  ;  Local  mem'u y  access 


Figure  3.4.  Communication  structure  of  message  search 
and  broadcast  through  the  Butterfly  network. 


nections  in  a  searching  tree  structure  are  conflict-free  at  the  same  level  on  the  same 
network. 

Proof:  Assume  any  two  connections  A”  -  Y  and  IJ  *  V  belong  to  the  connections  in 
the  searching  tree  structure  at  level  h.  Since  they  are  conflict-free,  according  to  the 
sufficient  condition  of  Theorem  3.1,  we  have  \p( X.U )  -  ^(V',  V)  <  ™  .  Now.  these  two 
connections  in  a  broadcasting  tree  structure  become  }'  A'  and  V  — *  U .  Since  the 
inequality  ^(Y.V)  -  <p(XM)  <  ™  is  also  true,  based  on  the  necessary  condition  of 
Theorem  3.1.  the  connections  in  a  broadcasting  tree  at  level  h  are  all  conflict-free.  L" 

Algorithm  3.3:  Memory  replication  technique  is  used  to  broadcast  the  pivot  from 
the  root  processor  to  all  other  processors’  local  memories.  Then  the  data  elements  on 
the  pivot  column  are  modified  simultaneously.  The  pivot  processor  is  defined  as  one, 
in  which  its  stored  element  V,  j  is  equal  to  the  pivot.  For  example,  if  pivot  V\.d,  then 
Pk  is  the  pivot  processor. 

Step  1:  Memory  replication 

The  broadcasting  tree  structure  is  used  to  replicate  the  pivot  from  the  root 
processor  to  all  other  processors’  local  memories.  Based  on  Theorem  3.4.  the 
memory  replication  can  be  accomplished  with  the  minimization  of  contention 
in  shared  memories  and  with  the  conflict-free  in  communication  links  at  each 
level. 

Step  2:  Now,  every  processor  receives  the  pivot  (V*^). 

FOR  i  0  TO  n  1  (all  processors  execute  in  parallel) 

IF  i  k  THEN  P,  performs  the  following  operation 
V,,d  v*—  (modify  the  pivot); 

ELSE  P ,  performs 

Y,,d  yf*  (modify  other  data  elements); 

END: 

Step  3:  Pivot  processor  /\  sends  its  local  data  Vjtj,  j  =0  to  r,  and  W *  to  the  root 
processor.  With  the  da'a  flow  similar  to  a  wave,  the  root  processor  broadcasts 
these  data  to  other  processors’  local  memories  by  using  the  broadcasting  tree 


Algorithm  3.4:  To  simultaneously  modify  the  data  elements  other  than  those  on  the 
pivot  column. 

FOR  i  0  TO  ?;  1  (all  processors  execute  in  parallel) 

IF  i  k  THEN  P,  performs  ( j  —  0  to  r,  and  j  ^  d)  and 

ELSE  Pt  performs 

V,,j  -  vk,j  *  {  (j  =  0  to  r,  and  j  +  d);  and 

H',  ^  Hfc* 

«  .  a 

END.  □ 

As  soon  as  the  first  iteration  is  completed,  the  second  pivot  column  is  selected 
according  to  the  next  negative  coefficient  of  the  objective  equation.  This  begins  the 
second  iteration  with  the  same  procedures  from  Algorithm  3.1  to  Algorithm  3.4.  The 
number  of  iteration  is  equal  to  the  number  of  negative  coefficient  of  the  objective 
equation. 

3.4.  Performance  Evaluation 

It  is  widely  known  that  performance  analysis  of  an  iterative  algorithm  on  the 
MIMD  multiprocessor  is  a  very  complex  and  difficult  job,  since  many  factors  jointly 
determine  algorithm  performance  and  the  modification  of  a  certain  factor  may  affect 
others.  For  simplicity,  we  make  a  few  assumptions  in  an  attempt  to  approximately 
predict  the  system  performance  by  complexity  analysis.  Note  that  the  real  system 
performance  should  be  better  than  the  following  analysis,  since  we  consider  the  worst 
case.  It  is  assumed  that  execution  of  identical  arithmetic  operations  on  different  pro¬ 
cessor  nodes  requires  the  same  response  time.  In  the  following  discussion,  a  denotes 
the  time  for  completing  one  multiplication,  3  denotes  the  time  for  completing  one  divi¬ 
sion,  t)  for  completing  one  addition,  o  for  completing  one  logic  comparison  operation, 
and  p  for  completing  one  remote  memory  access  with  deposit  mode.  We  also  assume 
that  the  time  required  for  a  local  memory  access  is  so  small  that  it  can  be  neglected. 
Let  T(i)  represent  the  execution  time  for  algorithm  i  in  one  iteration,  Tpara  represent 


the  execution  time  of  the  parallel  algorithms  on  the  Butterfly ™  multiprocessor  in 
one  iteration,  and  Tun,t  represent  the  execution  time  of  the  sequential  algorithm  on 
uniprocessor  in  one  iteration,  we  derive  the  following  three  inequalities: 

Tpara  <  E^l)  = 

i=  1 

0  f-  {{v  -r  <r)log2n)  -  (p(log2n)  -+■  0  -t-  p(r  +  2)(l  4-  log2n ))  +  (r  +  l)(o  -r  77);  (3.1) 


Tumt  >  n0  +  (n  -  l)cr  +  n0  +  (r  -+  2)/?  +  n(r  +  2)(a  +  77);  (3.2) 


r  -»  1 

5_^  ^tim( 

Speedup  --  *  1  > 

V  T 

.  -1  pa  ra 
k  =  1 


r  +  1 

52  (n/3  +  (n  -  1  )o  +  n0  +  (r  +  2)0  +  n(r  +  2)(«  +  77)) 

_ k= J _ _ _ 

r  +  1 

52  (0  +  ((^  +  a)log2n)  +  (0  +  p{r  +  2)  -f  p(r  +  3 )log2n)  +  (r  +  l)(a  +  77)) 
fc  =  1 


where  the  maximum  number  of  iterations  is  equal  to  r  +  1.  If  we  assume  the  itera¬ 
tive  parallel  algorithm  is  homogeneous,  then  the  time  for  completing  each  iteration  is 
almost  the  same.  When  r  approaches  to  n,  we  have 


0(ti2) 

speedup  >  — — - -. 

0{nlog2n) 


This  expression  indicates  that  the  speedup  is  a  first-order  increasing  function  of  the 
problem  size  n  as  n  becomes  a  reasonable  large  number.  The  result  also  verifies  our 
claims  that  the  parallel  algorithms  can  achieve  a  higher  system  performance  by  taking 
advantages  of  the  mapping  method. 
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CHAPTER  4 


EVALUATION  OF  EXISTING  DEPENDABILITY  TOOLS 

It  was  pointed  out  in  the  introduction  that  there  exists  no  tool  that  can  be  used  for 
the  exact  dependability  evaluation  directly.  By  exact,  we  mean  a  Markovian  or  Seim 
Markovian  approach  to  capture  the  component  failure  and  repair  processes  accural*  h 
A  •  hough  there  ha.>  been  some  attempt  to  model  a  MIN-based  multiprocessor  using 
Markovian  technique  Arlat  83.  Blake  87  .  these  models  are  very  simple  and  are  also 
not  easily  extendable  to  large  systems.  In  the  absence  of  an  exact  analytical  model, 
an  approximate  model  is  preferred  if  it  can  give  acceptable  results.  Moreover,  in 
the  process  of  developing  an  approximate  model  one  can  get  better  insight  to  go  for 
the  exact  technique.  In  sections  5  and  6,  we  present  approximate  techniques  for  the 
dependability  evaluation  of  Butterfly  and  hypercube  systems.  In  this  section  we  first 
present  a  brief -nmmary  of  some  of  the  existing  tools  with  highlighting  their  limitations 
for  the  dcp<  ndability  evaluation  of  candidate  parallel  computers. 

There  are  a  number  of  existing  tools  available  for  computing  dependability  of 
redundant  systems.  Tools  such  as  ARIES,  CARE  III,  HARP,  and  SH  \RPE  can  be 
used  for  reliability  analysis  where  as  tools  such  as  HARP.  SHARPE  and  SA\  E  can  be 
used  for  both  reliability  and  availability  analysis.  SHARPE,  on  tin  other  hand,  can 
be  used  for  performability  evaluation.  In  this  section  a  brief  summary  of  CARE  III. 
HARP,  and  SHARPE  is  given.  The  conclusions  regarding  the  applicability  of  these 
models  to  candidate  architectures  also  apply  to  other  tools  not  summarized  here. 

4.1  CARE  III 

CARE  III  (Computer  Aided  Reliability  Estimation,  Version  Three)  is  a  program 
designed  to  estimate  reliability  of  complex  redundant  systems  jStiffler  82 j .  It  was 


developed  specifically  for  fault-tolerant  avionics  systems,  CARE  III  features  are  sum¬ 
marized  below. 

Capabilities 

Predict  the  unreliability  ( 1-reliability)  of  a  system  consisting  of  up  to  70  stages 
with  each  stage  composed  of  one  or  more  identical  modules. 

Can  handle  hardware  software  faults  of  various  types  such  as  permanent,  transient 
and  intermittent. 

User  must  specify  the  number  of  modules  in  each  stage,  the  minimum  number 
of  modules  needed  in  each  stage  for  the  system  to  operate  properly,  the  various 
combination  of  stage  failures  that  constitute  a  system  failure  and  the  probability 
that  a  specific  module  from  stage  i  forms  a  critical  pair(system  failure)  with  a 
specific  modules  from  stage  j.  Hence,  a  system  tree  specification  involving  the 
critical  pairs  must  be  given  as  input  to  the  program.  The  lower  level  faults  in  the 
fault  tree  specification  are  stage  failures. 

Modules  imperfect  fault  handling  (coverage)  using  Markovian  technique. 

Fault  distribution  is  given  by  a  Weibull  function. 

Disadvantages 

Can  not  model  availability. 

Fault  tree  in  terms  of  critical  pairs  of  a  MIN-based  system  or  hypercube  is  very 
difficult.  The  number  of  each  critical  failure  combination  can  be  too  large  to  spec¬ 
ify  for  a  medium  or  large  size  system.  Particularly  various  combination  of  switch 
failures  that  can  lead  to  system  failure  in  a  Butterfly  type  system  is  extremely 
difficult  to  specify. 

Can  not  model  performance-related  dependability. 

4.2HARP 

HARP  (Hybrid  automated  Reliability  Predictor)  [Bavuso  87,  Geist  83]  is  a  soft¬ 
ware  package  that  implements  dependability  modeling  techniques.  Its  advantages  and 
disadvantages  are  given  below. 

Capabilities 


Can  compute  both  reliability  and  transient  availability  of  computer  systems  using 
behavioral  decomposition  along  temporal  lines.  The  overall  model  is  decomposed 
into  fault-occurrence /repair  (FORM)  and  fault/error  handling  (FEHM)  submod¬ 
ules  to  analyze  the  fault-occurrence  and  coverage  effects  effectively. 

Can  handle  various  types  of  faults  as  described  in  CARE  III. 

User  must  input  either  the  Markov  chain  of  the  system  or  a  Petri-net  model,  which 
can  be  converted  to  Markov  chain  automatically  for  computing  dependability.  The 
other  alternative  input  can  be  a  fault  tree  specification  of  the  system. 

Can  model  systems  with  sequence  dependant  failures. 

Gives  guaranteed  bounds  on  reliability. 

Weibull  distribution  for  reliability  modeling. 

Disadvantages 

Cannot  compute  MTTF  or  steady  state  behavior  for  repairable  systems. 

Cannot  guarantee  the  Markov  chain  automatically.  As  has  been  pointed  out 
earlier,  generation  of  the  Markov  chain  is  complex  for  systems  like  Butterfly  or 
Hypercube.  Also,  a  fault-free  specification  of  the  candidate  parallel  system  is  not 
simple.  Hence,  the  difficulty  of  finding  the  input  model  restricts  the  usefulness  of 
HARP  to  parallel  architectures  under  consideration. 

4.3  SHARPE 

SHARPE  (Symbolic  Hierarchical  Automated  Reliability  and  Performance  Evalu¬ 
ator)  is  currently  under  development  at  Duke  University  (Shaner  87],  In  addition  to 
dependability  evaluation,  it  has  the  capability  to  include  performance  with  depend¬ 
ability,  such  as  performability.  It’s  advantage  are  the  following. 

Capabilities 

Supports  seven  model  types  such  as  reliability  block  diagram,  fault  tree  without 
repeated  nodes,  acyclic  Markov  chains  and  irreducible  cyclic  Markov  chains  to  be 
combined  hierarchically  in  a  flexible  manner. 

Allows  to  use  either  combinatorial  or  Markov/Semi-Markov  submodules. 

Uses  Symbolic  computation. 


Input  to  the  model  is  in  the  format  of  reliability  block  diagram,  fault-tree,  or 
Markov  chain. 

Disadvantages 

As  like  HARP,  construction  of  the  fault-tree  or  Markov  chain  is  again  the  chal¬ 
lenging  problem.  Development  of  a  reliability  block  diagram  is  also  not  simple  to 
model  task  based  evaluation. 
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CHAPTER  5 

BUTTERFLY  DEPENDABILITY  MODELING 


We  have  developed  a  preliminary  analytical  model  for  computing  the  reliability 
of  Butterfly  network  based  multiprocessors  Tien  88’.  The  model  is  preliminary  in 
the  sense  that  the  actual  Butterfly  node  failure/repair  behavior  is  not  included  in  the 
model  to  address  the  coverage  accurately.  Also,  it  does  not  address  the  details  of  a 
Butterfly  system  such  as  extra  stage  of  switches,  software  failures,  and  system  sizes 
which  are  not  powers  of  four.  It  captures  graceful  degradation  of  a  Butterfly  system 
by  considering  the  failure  of  processors,  memories,  and  4x4  switching  elements  (SES) 
that  constitute  the  Butterfly  network. 

The  modeling  approach  is  based  on  system  decomposition.  Since,  the  Butterfly 
system  uses  4x4  switches,  the  system  size  is  generally  given  by  4*.  Although  systems 
available  today  can  be  configured  with  any  number  of  nodes  n,  n  <  256,  the  con¬ 
figurations  that  are  not  powers  of  4  do  not  use  all  the  N/4log4N  switches  used  for 
the  interconnection  network.  Hence,  this  model  addresses  4‘  systems  first.  Extension 
of  the  model  to  other  configuration  when  N  ±  4’  is  under  investigation  now.  The 
modeling  approach  is  based  on  system  decomposition  and  combinational  techniques. 
The  reliability  of  a  4'  system  is  obtained  from  four  4,_I  subsystems  and  the  connec¬ 
tion  pattern  between  those  subsystems.  The  reliability  model  assumes  a  homogenous 
multiprocessor  system.  The  PEs,  MMs,  and  SEs  are  homogenous  and  have  identical 
exponential  failure  distributions.  We  define  Ap,  \m,  and  As  as  the  failure  rate  of  a 
PE.  MM,  and  SE,  respectively.  The  corresponding  component  reliabilities  are  given  by 
Rp(t)  e"  A*',  Rm{t)  =  e "Am<,  and  Rae{t)  =  e-''*'.  Task  based  reliability  is  computed 
by  looking  for  a  connected  system  with  at  least  I  processors  and  J  memories 


5.1  16x16  System  Reliability 


A  16x16  multiprocessor  with  8  switches  is  shown  in  Figure  5.1.  We  will  call  it  a 
16  node  system  since  a  PE,  and  a  AIM,  are  assembled  on  a  single  board.  The  16  node 
system  can  be  decomposed  into  four  4x4  subsystems  while  keeping  the  communication 
between  the  subsystems  undisturbed.  A  subsystem  with  4  processors,  4  memories, 
and  2  switches  is  shown  in  Figure  5.2.  While  a  4x4  configuration  requires  only  one  SE, 
Figure  5.2  uses  2  SEs.  One  switch  connects  the  4  PEs.  and  the  second  switch  connects 
the  4  memories.  We  call  these  the  input  and  output  switches  respectively.  A  system 
with  more  than  4  PEs  and  4  MMs  needs  at  least  one  input  SE  and  one  output  SE  to 
establish  the  connection. 


5.1.1  4x4  Analysis 


We  first  compute  the  probability  of  having  exactly  i  PEs  and  j  MMs  connected 
at  time  t,  for  i,j  <  4,  in  Figure  5.2.  This  is  given  by 


^4(| i,/)(0  ~  ^{i  =  OAj  =0}  (1  -^ae(O) 


+  ({)«*,(()(!  -  *p('))4"  (*)«£.(<)(!  -  R2.,(t)  (5.1) 

The  subscript  4(i,j)  stands  for  selecting  i  PEs  and  j  MMs  from  a  4x4  or  4  node 
system.  /{1=oa.7=o}  is  an  indicator  function  given  by 


/{.=0Aj=0}  -  { 


1  if  i  =  0  A  j  =  0 
0  if  t  /  0  V  j  ^  0 


The  first  term  in  equation  (5.1)  represents  the  situation  where  any  number  of 
processors  and  memories  are  working  when  at  least  one  SE  has  failed.  This  term 
contributes  to  the  P4(o,o)  probability.  The  second  term  in  equation  (5.1)  denotes  the 
connection  of  i  working  processors  with  j  working  memories  when  both  the  SEs  are 
fault  free.  For  example,  the  probability  of  2  processors  connected  to  3  memories  is 
given  by 
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Pm 2,3,10  =  (j)flJ(O(i  -  *,(<))’ Qom  -  »„(())*L(i) 


It  should  be  observed  that  is  included  for  terms  like  P4(0j) (f )  or  -P4 ( « , 0 )  ( 0 ■ 

This  is  because  the  input  and  output  SEs  are  utilized  for  any  reference  to  be  satisfied. 


5.1.2  16x16  Analysis 


The  reliability  of  a  16x16  system  is  derived  from  the  basic  4x4  model.  Since  the  16 
processors  and  16  memories  can  be  divided  into  four  groups  each  with  their  associated 
SEs,  we  need  to  distribute  the  required  number  of  components  among  4  groups.  For 
example,  all  possible  distributions  of  i  PEs  within  four  processor  groups  and  j  MMs 
within  four  memory  groups  must  be  considered.  The  probability  of  selecting  i  PEs 
and  j  MMs  (txj)  from  a  16x16  system  at  time  t  is  given  by 


fco  1  ^2  (o  h  1-2 

^*16(tj)(0  1  zL*  zlC  zL*  ^(»o,>o)(0^4(»,,J,)(0^>4(»a,>a)(0'P4(i»j»)(0 

l0  =0  I]  =0  «2  =0  Jo  =0  ),  =0  -  0 

(5.3) 

where 

ka  =  mm(4,t) 

k\  -  min( 4,  i  -  i0) 

k2  -  min(4,i  -  t0  -  tj) 

kz-i~  (iQ  +  t'i  +  i2) 

10  =  min(4,j) 

1 1  =  min(4,j  -  j0) 

/2  -  min(4,j  -  j0  -  J\) 
h  -  3  ~  Uo  +  J\  +  37) 

The  distribution  of  i  PEs  among  4  groups  is  such  that  to  +  t’i  +  i2  +  13  =  t  and  is 
controlled  by  the  last  term  t3-  Since  (,4S)=0  for  *3  >  4,  all  possible  valid  distributions 
of  z  PEs  among  four  processors  groups  are  generated  by  the  first  three  summation 
expressions.  The  k, s  control  the  maximum  number  of  PEs  to  4  in  a  group.  The  last 


three  summation  terms  with  the  corresponding  /, s.  generate  all  the  distribution  j  MMs 
among  four  groups. 

Any  valid  distribution  of  the  processors  and  memories  are  combined  into  four 
processor  memory  pairs.  The  P4(ix,jy)  (0  <  x,y  <  3)  is  the  same  as  in  equation  (5.1). 
It  should  be  observed  that  by  including  the  input  and  output  SEs  in  a  4x4  group,  we 
guarantee  connection  among  the  i  PEs  and  j  MMs  from  the  four  groups.  For  example, 
a  PA{ij.o)  group  can  access  a  P4(0.jy)  group  as  the  required  four  SE  reliabilities  are 
included  in  the  expression. 

The  reliability  of  a  16x10  multiprocessor  with  at  least  I  PEs  and  ,1  MMs  working 
connected  is  then  given  by 

v  v 

#s(,v,.\)(0  =  'll  ^v('o)(0  (5-4) 

i  =  l j  -  J 

where  N  is  the  size  of  the  system.  In  this  case  N  -  16.  The  reliability  variation  is  plotted 
in  Figure  5.3.  The  results  are  given  for  PE  failure  rate  \p  -  0.0001,  MM  failure  rate 
Am  0.0001.  and  SE  failure  rate  \s  0.00002.  We  have  assumed  a  perfect  coverage  in 
this  study.  However,  coverage  parameters  for  the  PEs,  MMs,  and  SEs  can  be  included 
in  the  model  directly.  The  solid  lines  express  the  analytical  results.  The  model  is 
validated  by  plotting  the  simulation  results,  shown  by  dotted  lines.  It  can  be  observed 
that  system  reliability  increases  by  allowing  graceful  degradation. 

5.2  64x64  System  Reliability 

The  system  size  grows  in  powers  of  4  when  4x4  switches  are  used  for  the  network. 
Therefore,  a  4'x4l  system  can  always  be  decomposed  into  4  4*_Ix4,_1  systems  without 
disturbing  connections.  Figure  5.4  shows  the  decomposition  of  a  64x64  architecture 
into  four  16x16  groups.  A  64x64  configuration  has  three  stages  with  16  switches  in 
each  s<age.  As  mentioned  in  the  previous  section,  the  input  stage  switches  (stage  0) 
are  included  with  the  processors,  and  the  output  stage  switches  are  included  with 


the  memories.  Hence,  each  group  of  16  PEs  ( PGX ),  or  16  MMs  (MGy),  has  four 
SEs  associated  with  it.  We  represent  a  processor-memory  group  by  ( PGx,MGy )  for 
0  <  x,  y  <  3. 


The  distribution  of  i  PEs  and  j  MMs  between  the  four  groups  can  be  done  in 
the  same  way  as  for  the  16x16  system.  However,  the  middle  stage  switch  (stage  1) 
controls  the  access  between  various  processor-memory  groups.  It  can  be  observed 
from  Figure  5.4  that  PGX  (0  <  x  <  3)  can  access  MGX  using  only  one  of  the  stage  1 
switches,  whereas,  PGX  uses  two  switches  in  stage  1  for  a  round  trip  communication 
with  MGy  when  x  ^  y.  We  number  the  stage  1  switches  by  a  2-tuple  notation  Sxy.  The 
first  number,  x,  represents  the  processor  group,  and  the  second  number,  y,  represents 
the  memory  group  for  which  a  switch  is  used.  For  example,  switch  10  is  used  for 
a  request  from  PG\  to  MGq.  The  round  trip  path  is  established  through  switch 
(01).  The  connection  between  various  processor-memory  groups  is  represented  by  a 
switching  node  table  given  in  Figure  5.5.  It  should  be  observed  that  the  upper  and 
lower  triangular  entries  in  the  table  are  exactly  the  same. 

The  switching  node  table  is  used  to  calculate  the  number  of  stage  1  switches  re¬ 
quired  for  connection  between  a  group  of  processors  and  memories.  For  example,  if 
PGq  and  PG\  need  all  the  four  memory  groups,  12  SEs  are  required.  This  is  because 
SEs  (01)  and  (10)  are  common  to  both  the  groups.  These  two  switches  should  be  in¬ 
cluded  only  once  to  calculate  the  total  number  of  SEs  required  to  establish  connection. 

5.2.1  Processor  Memory  Distribution 

There  are  two  different  ways  a  connected  group  of  t  PEs  and  j  MMs  can  be 
available  on  the  system.  The  first  is  the  case  where  exactly  i  PEs  and  j  MMs  are 
working,  and  at  least  the  required  number  of  stage  1  SEs  Eire  perfect  for  providing 
connection  between  any  PE  and  MM.  In  the  second  situation,  more  than  the  required 
number  of  processors  and/or  memories  may  be  working  on  the  system,  but,  the  total 
connectivity  is  (ixj).  This  is  possible  when  the  number  of  stage  1  working  switches 


are  just  sufficient  to  provide  a  connectivity  (ixj).  Both  the  cases  are  analyzed  in  detail 
below. 


5.2.2  Exactly  (ixj)  elements  working 

Since  the  i  PEs  and  j  MMs  can  be  distributed  in  upto  four  groups,  the  first  step 
in  computing  this  probability  is  to  find  the  number  of  stage  1  switches  required  for 
connection.  Let  Nc  represents  the  number  of  stage  1  switches  required  to  connect  t 
PEs  and  j  MMs.  Nc  is  given  by 

3  3 

Nc  —  ^  ^  Nxy\(PGx  v  MG y  ±  0  &  Nxy  not  included)  (5.5) 

1=0  y= 0 

where  Nxy  is  the  number  of  SEs  required  to  connected  PGX  to  MGy.  Nxy  is  obtained 
from  the  switching  node  table  of  Figure  5.5.  Nxy  =  2  when  i  ^  y,  and  Nxy  =  1  when 
x  =  y.  It  should  be  observed  that  equation  (5.5)  is  a  conditional  expression.  The  first 
condition  says  that  if  there  are  no  working  elements  either  from  a  processor  or  memory 
gioup,  then  Nxy  =  0.  The  second  condition  ensures  that  the  same  switch  should  not 
be  included  twice.  For  example,  (PG0,MG i)  connection  and  (PGi,MG0)  connection 
need  the  same  two  switches  (01 )  and  (10).  Therefore,  Nio  should  not  be  included  in 
Nc,  as  Nqi  is  already  included.  The  Nc  calculation  is  illustrated  below  by  an  example. 
Example-1 

Let  the  i  PEs  be  distributed  in  groups  PG 0,  PGi,  and  PG2 ■  The  j  MMs  are  selected 
from  MGX,  MG2 ,  and  MG3.  Then 

Nc  —  Noo  +  Aroi  +  N02  +  N03  +  ^10  +  •Wll  +  JV 1 2  +  Ni  3  +  N 20  +  N21  +  N22  +  N23 
=  0  +  2  +  2  +  2  +  0+  l  +  2  +  2  +  0  +  0+  l  +  2  =  14 

As  exactly  Nc  SEs  are  required  to  connect  i  PEs  and  j  MMs, the  stage  1  SEs  are 
now  divided  into  two  groups.  The  first  group  is  the  required  number  of  SEs  Nc.  The 
second  group  is  the  additional  SEs  (16-Nc).  The  state  of  the  additional  switches  does 
not  affect  the  working  group  (ixj).  Therefore,  the  probability  of  t  PEs  and  j  MMs 
working  at  time  t  given  by 


(5.6) 
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^16(«,Ji)(0-P \e(t2,j,){t)P\  16(t*js)(0  Rse{t)NC  (5.6) 

Equation  (5.6)  is  identical  to  equation  (5.3)  except  that  the  subgroup  sizes  are 
16  instead  of  4.  Evaluation  of  the  term  Px 6(i* ,>,)(*)»  for  0  <  x,y  <  3,  is  done  using 
equation  (5.3). 

5.2.3  More  than  (fx?)  elements  working 

This  is  the  situation  where  the  number  of  connected  processors  and  memories  are 
limited  by  the  failure  of  the  stage-1  switching  elements.  We  illustrate  this  situation 
by  an  example. 

Example-2 

Consider  the  distribution  Pi6(i6,i6)(0>  /’i6(i6,i6)(0’  ^i6(i6,o)(<)>  and  Pi6(o,o)(0- 
All  the  PEs  and  MMs  from  group  0  and  group  1  are  working.  Group  2  has  only 
16  PEs  working  but  memory  connection  is  zero.  Group  3  has  all  elements  0.  The 
system  size  is  given  by  (48x32).  Let  Np  be  the  number  of  (0,MGz)  groups  and  Nm 
be  the  number  of  (PGz,0)  groups  in  the  distribution.  For  the  above  case  NP= 0  and 
NM—  1.  Now, as  long  as  S22  has  failed,  the  number  of  MMs  working  in  the  third 
group  is  immaterial.  PG2  and  MG2  are  disconnected  when  S22  has  failed.  PG2  and 
MG2  are  individually  connected  to  the  first  and  second  group  through  S20,  S02  and 
S2i ,  S 12,  but  not  connected  as  a  (PG2,MG2)  group.  Hence,  the  system  size  remains 
the  same,  (48x32),  with  working  MMs  in  group  2. 

When  at  least  one  of  the  SEs  from  each  group  N30,  JV31,  and  JV32  has  failed, 
the  number  of  working  processors  and  memories  from  group  3  does  not  increase  the 
system  size  from  (48x32).  In  other  words,  the  failure  of  at  least  one  SE  from  Nxy 
disconnects  processor  group  x  from  memory  group  y.  The  above  two  distributions, 
(0,0)  and  (PGz,0)  or  (0,MGz),  are  combined  to  give  the  maximum  numbers  of  switches 
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Nf  that  can  fail  to  disconnect  the  failed  groups  from  the  rest  of  the  system.  Nf  is 
expressed  as 


Nf  =  N'fX  f  JV/2  (5.7) 

The  first  term  N'fx  gives  the  maximum  number  of  SEs  that  can  fail  to  disconnect  a 
(0,  0)  group  from  the  rest  of  the  system.  The  second  term  Nf  2  denotes  the  number  of 
SEs  that  must  fail  to  keep  a  group  size  ( PGX ,  0)  or  (0,  MGX)  even  though  there  is  at 
least  a  memory  or  processor  working  in  the  null  groups  respectively.  Let  N f  \  denotes 
the  minimum  number  of  SEs  that  must  fail  to  disconnect  a  (0,0)  group  from  the  rest 
of  the  system.  N'^,  Nf  j,  and  Nf2  can  be  expressed  as 

3  3 

Nf  l  —  ^  ^  ^  ]{^iyl(PG,AMG,=0),  (x#v),  not  included)}  (5.8 .a) 

x=0 y= 0 

Nf  i  =  -  i  (5.8.6) 

and 


Nf2  -  {^{Np>wm}((-^P  -  1)  +  Nm)  -h  I{Np<Nm}[[Nm  -  1)  +  Np)}+ 

3 

5>.  il(PG,vMG,  =0),  (PG^MGj/O!,  not  included}  (5.8.c) 

i-O 

Both  these  terms  are  conditional  to  avoid  the  inclusion  of  the  same  switches  twice. 
The  first  term  Nf ,  counts  the  total  number  of  SEs  that  disconnects  a  (0,0)  group  from 
the  working  groups  of  the  system.  It  does  not  include  the  SE  Nxx.  The  third  term 
N f2  counts  only  the  Nxx  switches  for  a  (PGX, 0)  or  (0,MG,)  group.  The  first  term  in 
(8.c)  counts  the  switches  that  should  fail  between  more  than  one  ( PGx,0 )  or  (0,MGx) 
groups.  The  indicator  function  I{Nr>Nm }  &nd  I{Nr<Nm }  are  used  to  select  the  proper 
switches.  The  second  term  in  (8.c)  counts  the  Nxx  switches  for  a  (PGX, o)  or  (0,MGx) 
group.  The  third  term  does  not  include  Nxx  for  a  (0,0)  group.  On  the  other  hand, 
the  minimum  number  of  SEs  that  must  fail  to  keep  the  system  size  (txj)  is  given  by 


N fm~ (N fi  .V/2).  This  is  because  1  out  of  each  2  SEs  in  ATiy(IJty)  is  sufficient 

for  disconnection  of  a  (0,0)  group.  Also  2  out  of  3  group  switches  are  sufficient  to 
disconnect  a  (0,0)  group.  For  the  above  example,  NfX  =  2,  jV/2  =  1,  and  Nfm  -  3. 
To  keep  the  SE  failure  model  simple,  we  consider  only  the  minimum  number  of  SEs 
required  to  disconnect  the  groups.  Since  1  out  of  2  SEs  for  each  ATj.y  is  required  for  a 
(0,0)  group,  the  total  number  of  ways  the  „Y/m  can  be  selected  is  given  by 
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For  the  above  example  A”  -  (2)  x  22. 

The  situation  where  more  than  1  PEs  and  j  MMs  are  working  but  the  total 


connectivity  is  (ixj)  is  then  given  by 

•f>64(«j)(0  S  12  ^>r6(«oOo)(^^>16(t,,j,  )(0 
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(5.10) 


Rse(t)N'=  in  equation  (5.10)  gives  the  probability  that  the  required  number  of  SEs  Nc 
are  working  to  keep  the  connectivity  {ixj).  (1  -  RSe{t))Nfm  X  represents  the  probability 
that  the  minimum  number  of  SEs  has  failed  so  that  the  connectivity  is  (ixj)  while 
there  are  actually  more  than  i  PEs  and/or  j  MMs  working  in  the  system.  The  term 
in  equation  (5.10)  stands  for 
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(5.11) 


if  ix  /  0  &  jx  -  0 

if  iz  =  Jx  -  0 


Evaluation  of  the  term  P*6^  j  ^(t)  depends  on  the  distribution.  When  neither  a 
processor  nor  a  memory  group  is  zero  in  group  x,  Pi6(tI,jI)(0  is  the  same  as  equation 


(5.3).  When  either  the  processor  or  memory  group  is  zero,  the  corresponding  proba¬ 
bilities  are  added  for  1  <  i  <  16  to  compute  The  -Pi6(o,o)(0  is  computed 

by  the  fourth  term  of  equation  (5.11).  It  should  be  observed  that  the  minimum  value 
of  x  is  1  for  the  summations  in  equation  (5.11),  since  all  failed  element  probabilities 
are  included  in  equation  (5.6). 

5.2.4  Reliability  Computation 


The  reliability  of  a  64x64  system  can  be  computed  by  combining  equations  (5.6) 
and  (10).  It  should  be  observed  that  all  valid  working  groups  are  generated  by  the 
above  two  equations.  The  only  probabilities  that  are  not  included  are  in  equation 
(5.10)  where  more  than  Nfm  SEs  can  also  fail  while  keeping  the  system  size  (txj). 
However,  the  contribution  of  this  expression  is  negligible  compared  to  equation  (5.6). 
This  is  mostly  because  of  the  term  (1  -  Rae{t))Nfm.  When  we  take  more  than  the 
minimum  number,  Nfm,  this  probability  decreases  even  faster.  This  argument  is  valid 
when  the  required  system  size  is  about  50%  of  the  original  size.  With  t  and  j  less  that 
32,  the  contribution  from  equation  (5.10)  is  about  10%. 

The  computation  of  equation  (5.10)  is  very  costly  in  terms  of  time.  Equations  (5.6) 
and  (10)  both  generate  all  the  distributions  and  compute  Nc.  In  addition,  equation 
(5.10)  generates  Nfm,  and  whenever  there  is  a  PE  and/or  MM  group  0,  it  computes 
either  one  or  more  of  the  last  three  expressions  of  equation  (5.11).  When  (txj)  size 
is  close  to  (NxN),  the  possibility  of  an  (tx  V  jy  =  0)  is  negligible.  With  lower  (txj) 
values  the  probability  of  finding  a  null  processor  and/or  memory  group  increases. 

It  can  be  observed  that  both  equations  (5.6)  and  (5.10)  are  combinatorial  expres¬ 
sions.  All  possible  distributions  of  (txj)  are  generated  in  these  equations.  Probability 
computation  for  all  of  these  combinations  is  time  consuming.  However,  the  equations 
can  be  evaluated  efficiently  by  avoiding  the  regeneration  of  the  similar  of  distributions. 
For  example,  consider  a  system  of  size  (48x48).  Four  possible  distributions  are  : 


Processor 


Memor\ 


All  of  these  combinations  have  the  same  probability,  since  Nc  is  9  for  all  four  cases. 
Hence,  we  need  compute  only  one  of  these  terms.  By  avoiding  the  recomputation  of 
similar  distributions,  the  computation  of  reliability  becomes  faster. 

Figure  5.6  shows  the  reliability  variation  for  a  64x64  multiprocessor  with  I=J=48 
and  I— J  — 32.  The  1=48  result  is  plotted  using  only  equation  (5.6).  The  analytical 
results  match  closely  with  simulation  without  including  equation  (5.10),  since  with  t 
and  j  equal  to  48,  only  one  group  of  PEs  and  MMs  can  be  0  at  a  time.  Hence,  the 
value  from  equation  (5.10)  are  negligible.  The  results  for  1=32  are  plotted  by  combining 
equations  (5.6)  and  (5.10).  We  have  observed  that  using  only  equation  (5.6),  the  results 
differ  from  the  simulation  less  than  10%.  So,  if  the  reliability  requirements  are  not 
stringent,  equation  (5.6)  should  be  sufficient  to  give  a  close  lower  bound  on  reliability. 

5.3  Generalization  to  Higher  Systems 

It  is  possible  to  extend  the  analysis  of  64x64  system  to  256x256  multiprocessor. 
The  basic  nature  of  the  equations  (5.6)  and  (10)  remain  the  same  except  that  each 
process  or  memory  group  has  now  64  elements.  A  unique  path  256  node  configuration 
has  4  stages  of  SEs  :  stages  0,  1,2,  and  3.  Each  stage  has  64  SEs.  The  decomposition 
of  the  256x256  system  into  4  64x64  groups  is  done  by  associating  the  stage  O-(input) 
SEs  with  processors,  and  stage  2,  and  3  SEs  with  the  memory  side.  Hence,  a  group  of 
64  PEs  has  16  SEs  associated  with  it.  A  group  of  64  MMs  has  16  SEs  of  stage  2  and 
16  SEs  of  stage  3  associated  with  it.  These  64  PEs  and  64  MMs  have  the  identical 
connection  of  a  64x64  system. 

The  four  groups  of  64  PEs  and  four  groups  of  64  MMs  are  connected  through  64 
stage  1  switches.  These  64  stage  1  SEs  can  be  divided  into  16  groups,  each  having  4 
switches.  We  can  then  represent  the  stage  1  connection  of  the  256  node  system  by  the 
same  switching  node  table  of  Figure  5.5.  Now,  each  2-tuple  notation  Szy  represents  a 


group  of  4  SEs.  For  example,  (00)  will  stand  of  4  SEs  that  connect  64  PEs  of  group  0 
to  64  MMs  of  group  0.  Similarly,  8  SEs  (01)  and  (12)  are  needed  for  communication 
between  PG o  and  MG\.  The  required  number  of  SEs  Nc  for  any  system  size  (ixj) 
can  be  found  by  counting  the  number  stage  1  of  groups  and  multiplying  this  number 
by  four.  Nfm  can  also  be  computed  similarly  using  equation  (5.7).  Hence,  equations 
(5.6)  and  (5.10)  can  be  used  by  changing  each  P i6(. ,,>„)(<)  notation  to  ^64 («, ,>„)(<) • 
The  (64x64)  system  results  are  used  to  compute  (256x256)  system  reliability. 

It  is  theoretically  possible  to  use  equations  (5.6)  and  (5.10)  for  a  256  node  relia¬ 
bility  computation.  But  the  computation  time  is  prohibitive.  This  will  be  illustrated 
by  an  example.  Let  us  assume  that  we  want  simply  to  compute  the  probability  of 
(192x192)  distribution.  One  possible  processor  grouping  is  (64,  64,  64,  0).  The  mem¬ 
ory  combinations  for  this  processor  grouping  vary  from  (64,  64,  64,  0)  to  (0,  64,  64, 
64).  The  generation  and  computation  of  this  large  number  of  memory  distributions 
for  each  processor  distribution  make  this  model  unattractive  for  higher  order  systems 
such  as  the  256  node  system.  One  can  avoid  recomputation  of  similar  combinations, 
as  discussed  in  section  4.2,  to  save  computation  time.  Using  these  simplification  tech¬ 
niques,  we  have  computed  the  reliability  of  a  256x256  multiprocessor  requiring  at  least 
192  PEs  and  192  MMs.  The  result  is  plotted  in  Figure  5.7.  We  also  have  written  a 
simulation  program  for  256  node  system  to  verify  this  analytical  results.  The  results 
are  compared  in  Figure  5.7. 

The  disadvantage  of  equation  (5.6)  and  (5.10)  for  higher  order  systems  is  mainly 
because  of  the  generation  of  all  distributions,  and  in  finding  the  numbers  Nc  and  N/m. 
Therefore,  there  is  no  approximation  involved  in  the  model  except  in  neglecting  the 
terms  Nfm+\  to  Nj.  As  mentioned  in  section  4.2,  contribution  of  these  terms  is  very 
small.  We  are  currently  looking  at  approximation  techniques  that  can  be  used  to 
compute  256  node  system  reliability  efficiently. 

One  such  approach  is  to  use  a  recursive  computation  of  higher  order  systems 
starting  from  the  4x4  model.  The  first  and  last  stage  SEs  are  always  included  with 
the  processors  and  memories.  Starting  from  the  4x4  model,  we  can  compute  the 


reliabilities  of  (8x8),  (16x16),  (32x32),  (64x64),  and  (128x128)  without  considering  the 
middle  stage  SEs.  A  (256x256)  system  reliability  can  then  be  computed  by  considering 
two  (128x128)  systems.  An  approximate  number  of  middle  stage  SEs  will  be  included 
in  these  expressions  to  provide  connections  between  the  PEs  and  MMs.  For  example, 
a  (64x64)  system  working  with  (48x48)  configuration  needs  at  least  9,  12,  or  16  SEs 
from  stage  1  depending  on  the  processor  memory  distribution.  We  should  then  be  able 
to  get  a  fairly  accurate  result  for  (64x64)  system  by  including  an  average  value  for  the 
reliability  of  stage  1  SEs  with  two  (32x32)  system  reliability.  The  same  principle  can 
be  applied  for  a  256  node  multiprocessor. 
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HYPERCUBE  DEPENDABILITY  MODELING 


We  have  developed  an  approximate  techniques  to  compute  the  reliability  of  Hy¬ 
percube  multiprocessors.  The  model  is  based  on  the  decomposition  principle,  where  a 
hypercube  of  a  higher  dimension  is  recursively  decomposed  into  smaller  hvpercubes, 
until  the  reliability  of  the  smallest  cube  is  modeled  exactly.  The  reliability  of  the  large 
11-cube  is  then  obtained  from  this  smallest  base  model  using  a  recursive  equation.  The 
reliability  model  used  is  task  based  -  it  is  assumed  that  the  system  is  operational  if 
the  task  can  be  executed  on  the  system.  Analytical  results  are  given  for  n-dimensional 
hypercubes  with  upto  75CT  system  degradation.  The  model  is  validated  by  comparing 
analytical  results  with  simulation  results. 

6.1.  Modeling  Technique 

We  use  a  2-cube  (4  nodes)  or  3-cube  (8  nodes)  system  as  the  base  model  in  this 
analysis.  The  exact  task  based  reliability  analysis  of  the  base  model  is  first  done  for 
various  numbers  of  required  nodes,  I.  where  I  <  2"  for  n  =  2  or  3.  The  reliability 
of  a  higher  dimension  cube  is  obtained  recursively  from  the  base  model  results.  We 
decompose  an  n-cube  into  2  (n-l)-cubes,  each  (n-l)-cube  in  turn  into  2  (n-2)-cubes, 
etc.,  until  a  4-cube  is  divided  into  2  base  model  3-cubes.  We  start  with  the  exact 
base  model  equations  and  derive  results  for  a  higher  dimension  system  by  considering 
the  connectivity  between  two  (n-l)-cube  systems.  One  possible  decomposition  of  the 
problem  for  a  5-cube  system  with  20  nodes  working  is  given  in  Figure  6.1. 

In  this  report,  we  shall  assume  that  the  failure  rate  of  the  links  is  negligible  com¬ 
pared  to  the  node  failure  rate.  Thus,  only  processor  failure  is  considered.  While  this 
is  an  optimistic  assumption,  it  is  widely  used  in  the  modeling  of  parallel  architectures 
to  keep  the  analysis  simple.  Further,  if  we  include  the  failure  rate  of  the  common  I/O 
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bus  along  with  the  processor  failure  rate,  the  failure  probability  of  individual  channels 
becomes  very  small.  In  this  case,  the  link  failure  assumption  becomes  less  critical. 

We  assume  homogeneity  of  processing  nodes,  with  identical  and  exponential  distri¬ 
bution  of  failure  time.  We  define  An  as  the  failure  rate  of  a  node.  We  consider  reliability 
evaluation  of  only  non-repairable  hypercube  systems  in  this  report.  A  separate  front 
end  host  processor  is  assumed  to  perform  all  the  maintenance  action.  Detection  and 
isolation  of  the  failed  nodes,  and  reconfiguration  of  the  system  to  a  degraded  mode, 
are  all  done  by  the  host  processor.  Host  processor  failure  probability  is  not  considered 
in  this  report.  However,  this  >  an  be  included  into  the  model  without  much  difficulty. 

We  use  the  following  notation  in  this  analysis: 


Notation: 

N  :  Number  of  nodes  in  the  hypercube,  N  =  2”. 

A  n  :  Random  variables  that  represent  the  number  of  processors  (nodes)  in  the  n-cube. 
G(N.i,p)  :  (^)//(l  p)s  " ',  the  probability  of  having  exactly  i  good  units  out  of  N 
units,  where  p  is  the  unit  reliability. 

A„  :  Node  failure  rate. 

Rn(t)  :  Node  reliability  at  time  t,  given  by  e  An<  =  p 

P(Xn  t  )  :The  probability  of  having  i  good  connected  units  in  the  n-cube. 

Rs{t)  :  The  n-cube  transient  reliability. 

Cn  i  (b  j)  '■  Connectivity  of  two  (n-l)-cubes;  one  cube  with  i  connected  processors  and 
the  other  cube  with  j  connected  processors. 

P(Cn  i  f  i]Xn  i  i)  :  The  conditional  probability  of  having  i  disconnected 
processors  working  in  the  (n-l)-cube. 

Dn-i(i.j)  :  Connectivity  of  two  (n-l)-cubes.  One  group  with  i  connected  nodes  and 
the  second  group  with  j  disconnected  nodes. 

Gc(X,x)  :  Number  of  x  connected  nodes  from  N. 

Gd{N.  x)  :  Number  of  x  disconnected  nodes  from  N. 

Cc(x.y)  :  Number  of  y  connected  nodes  from  x  connected  nodes 
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6.2  The  Base  Model 

In  this  section  exact  analyses  for  2-cube  and  3-cube  configurations  are  presented. 
We  assume  perfect  coverage  in  all  this  analyses.  However,  an  appropriate  value  for 
coverage  could  be  included  in  the  model  without  changing  its  basic  structure. 

6.2.1  2-Cube  analysis 

A  2-dimensional  hypercube  (with  N  =  4)  is  shown  in  Figure  6.2. 

From  simple  combinatorics,  we  have  the  exact  probability  for  various  numbers  of 
connected  working  nodes  at  time  t  : 


^(A"ri  4)  --  Rn(t)4  (6.1. a) 

P{X„  3)  =  4Rn(t)3(l  -  Rn{t))  (6.1.b) 

P(X„  --  2)  —  4i?n(t)2(l  —  Rn(t))2  (6.1.c) 

P(xn  1)  =  4Rn(t)(l  -  Rn(t))3  +  2Rn(t)2(l  -  Rn(t))2  (6.1.d) 

P(Xn  o)  -  (1  -  Rn(t))4  (6.1.e) 


It  should  be  observed  from  P[Xn  =  1)  that  a  situation  such  as  nodes  {0,3} 
working,  or  {1,2}  working,  in  Figure  6.2  gives  effectively  only  one  working  node,  as 
the  diagonal  elements  are  not  connected. 

6.2.2  3-Cube  analysis 

A  3-dimensional  hypercube  (with  N-8)  is  shown  in  Figure  6.3.  While  the  prob¬ 
ability  expressions  below  could  be  obtained  using  the  2-cube  model,  we  derive  them 
directly  due  to  their  simplicity. 

The  probability  of  exactly  I  connected  processors  working  in  the  3-cube,  for  0  < 
I  <  8,  are  given  below. 


P(xn 

■  8) 

G(4.4,p)G(4,4,p)  =  Rn(t)s 

(6. 2. a) 

P(X  „ 

7) 

2(7(4, 4, p)G(4, 3,p)  =  8i?„(07(l  -  Rn(t)) 

(6.2.b) 

P{  An 

6) 

2G(4,2,p)G(4,4,p)  hG(4,3,p)G(4,3,p) 

28i?„(/)6(l  Rn(t))2 

(6.2.c) 

(6.2.c) 


(6.2.d) 


P(Xn  =  5)  =  2G(4.  l,p)G(4,4,p)  2G(4,2,p)G(4,3,p)  -  P(Xn  =  5)dlsc 

56/?n(/)5(l  -  Rn(t))3  -  2  *  4Rn(t)5(l  -  i?„(0)3 
-  48i?„(0S(l  ~  Rn(t))3 

The  disconnected  probability  P(X„  =  5)d,ic  appears  in  the  above  expression  to 
take  care  of  the  situations  where  4  out  of  5  working  nodes  are  connected.  An  example 
is  when  nodes  {0.2, 3. 5, 6}  are  working.  There  will  be  8  such  cases  in  the  3-cube. 


P{ Xn  -  4)  =  P(Xn  =  5)dlsc  +  2G(4,0,p)G(4,4,p) 

-  2G(4,l,p)G(4,3,p)  +  G(4,2,p)G(4,2,p)  -  P(Xn  =  4)dlsc 
8/?n(/)5(l  -  Rn(t ))3  +  2fl„(t)4(l  -  Rn(t))4  -  32i?n(<)4(l  -  Rn(l))4 

+  36i?„ (< ) 5 ( 1  -  Rn(t))4  -  32Rn{t)4{l  -  Rn{t))4 

-  8/?n(<)5(l  -  Rn(t))3  +  38J?„(<)4(1  -  Rn(t))4  (6.2.e) 

The  P( Xn  --  5)d,ac  is  the  same  as  P(Xn  -  5 )dtsc  terms  of  the  previous  equation 
(2.d).  With  4  nodes  working,  there  can  be  1,  2,  or  3  connected  nodes  which  should 
be  subtracted  from  the  P(Xn  =  4).  For  example,  working  nodes  {1,2,4,  7}  are  all 
disconnected  and  contribute  only  to  P(Xn  -  1).  Nodes  {0,2, 5,  7}  give  2  connected 
groups  and  add  to  P(X„  =2).  Finally,  nodes  {0, 2,3,5}  contribute  only  to  P(Xn  =  3). 
All  of  these  disconnected  terms  are  included  in  the  term  32i?„(t)4(l  -  i?n(t))4. 


P{Xn  =  3)  =  P(Xn  =  4)dlsc  +  2G(4,0,p)G(4,3,p) 

+  2G(4,l,p)G(4,2,p)  -  P(Xn  =  3)dlsc 
=  24Rn(t)4(l  -  Rn(t))4  +  8Rn(t)3(l  -  Rn(t ))# 

+  48i?n(t)3(l  -  Rn(t))5  -  32Rn(t)3(l  -  Rn{t))5 

The  first  term  repre.  °nts  the  24  cases  where,  out  of  4  working  nodes,  only  3  are 
connected.  The  last  term  represents  the  32  cases  where  Cri_1(  1,2)  ^  3.  The  expression 
simplifies  to 


P{Xn  =  3)  =  24Rn{t)4{l  -  fln(<))4  +  24Rn(t)3{l  -  Rn(t ))5  (6.2.f) 

P(Xn  =  2)  =  P{Xn  =  4)rf1(iC  +  P(Xn  =  3)dtsc  +  2G(4, 0,  p)G(4, 2,  p) 

+  G(4,l,p)G(4,l,p)  -  P(Xn  =  2)d,ic 


=  6i?n(04(l  -  Rn{t))4  -  24#„(<)3(1  -  i?„(0)5 

+  12i?„(02(l  -  Rn(t))6  -  16Rn(t)2(l  -  Rn{t))6  -  l6Rn{t)2(l  -  Rn(t))6 

The  first  term  represents  the  cases  where  2  out  of  4  working  nodes  are  connected. 
The  second  term  represents  the  24  cases  where  2  out  of  3  working  nodes  are  connected. 
The  last  term  is  subtracted  to  take  care  of  the  situations  where  2  working  nodes  are 
disconnected.  After  simplification  we  get 

P(Xn  =  2)  -  6fl„(04(l  -  Rn(t))4  -  24i?n(t)3(l  -  fl„(<))5  +  12/?„(02(1  -  Rn{t))e 

(6-2.g) 

P{ Xn  -  1)  -  P( A'n  -  4)rftsr  -  P(Xn  ---  3)disc  -t  P(xn  =  2)dlsc  4  2G(4,0,p)G(4,l,p) 
--  2i2„(04(l  -  fln(O)4  -  8^n(03(l  -  Rn(t))S 

+  167?„(02(1  -  Rn{t))6  +  8fi„(0(l  -  Rn(t))7  (6.2.h) 

The  first  term  represents  the  two  cases  where  two  diagonal  elements  from  each 
2-cube  are  working,  but  are  disconnected.  The  second  term  is  for  the  8  cases  where  all 
3  working  nodes  are  isolated.  The  third  term  is  for  the  cases  where  2  working  nodes 
are  disconnected 

P(Xn  =  0)  =■  (1  -  Rn(t))s  (6.2.i) 

6.3  Generalized  Model 

In  this  section  we  develop  a  generalized  reliability  model  for  an  n-cube,  for  n  >  3. 
We  assume  that  the  system  works  as  long  as  I  connected  processors  are  working  in  the 
hypercube.  The  system  reliability  is  expressed  as 

N 

R*{t)  =  12^(0  (6.3) 

j=i 

where  Pj(t)  is  the  probability  that  j  connected  processors  are  working  in  the  system 
at  time  t.  At  any  time  t,  Pj(t)  is  given  by  P(Xn  —  j).  Hence,  dropping  the  time 
parameter,  system  reliability  can  be  written  as 

N 

Rs(Xn  >  /)  =  ^P(Xn  =  j) 


(6.4) 


6.3.1  System  Decomposition 


To  calculate  the  probability  P{ Xn  =  j)  we  divide  the  n-cube  into  two  (n-l)-cubes 
(groups).  There  are  two  situations  under  which  there  will  be  j  connected  processors 


in  the  n-cube.  In  the  first  situation,  exactly  j  connected  nodes  are  working  in  the 


hypercube,  with  k  nodes  in  one  group  and  (j-k)  nodes  in  the  second  group.  In  the 
second  situation  there  are  more  than  j  nodes  working  in  the  hypercube,  but  the  actual 
connectivity  is  only  j.  For  either  of  these  two  cases,  there  are  two  possibilities  for  the 
nature  of  the  connectivity  between  the  two  (n-l)-cubes.  Either  the  k  and  (j-k)  nodes 
working  in  the  two  (n-l)-cubes  are  all  connected  in  their  individual  groups  and  the 
total  connectivity  is  j,  or  one  of  the  two  groups  is  not  internally  connected  but  the 
total  connectivity  is  still  j.  For  the  second  case,  where  a  total  of  more  than  j  nodes 
are  working  in  the  two  groups,  the  two  working  groups  may  or  may  not  be  connected. 


These  four  possible  working  node  distributions  are  discussed  below. 


Distribution  I 


This  is  the  case  where  there  are  two  connected  groups  with  a  total  of  exactly  j 
connected  nodes  for  a  given  j  >  1.  Let  us  divide  these  nodes  into  two  groups  such 
that  k  connected  nodes  are  working  in  one  of  the  (n-l)-cubes,  and  the  remaining  (j-k) 
connected  nodes  are  working  in  the  second  group.  These  two  groups  must  be  connected 
such  that  the  total  connectivity  is  j.  This  situation  can  be  represented  as 


P(*n-i  =  k)P{ AV,  =  3  ~  *)(1  -  P(C„_i(fc,j  -  k)  ±  j)) 


The  first  two  terms  give  the  probabilities  that  k  and  (j-k)  nodes  are  connected  in  their 
respective  groups.  P(Cn-\(k,  j  -  k)  ^  j))  is  the  probability  that  the  total  connectivity 
in  the  n-cube,  which  is  given  by  the  connectivity  of  k  and  (j-k)  nodes  in  the  two  (n-1)- 
cubes,  is  not  j.  Since  we  are  interested  in  exactly  j  connections,  (1  -  P{Cn-\  (k,j  -  it)  ^ 
j))  is  used  in  equation  (6.5).  If  j  >  2n_1,  then  P{Cn-\{k,  j  —  k)  ^  j)  =  0. 


Distribution  II 
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The  other  possibility  is  that  k  connected  nodes  are  working  in  one  (n-l)-cube, 
and  (j-k)-nodes  are  working  in  a  disconnected  fashion  in  the  second  (n-l)-cube,  but 
all  of  the  j  nodes  happen  to  be  connected.  For  example,  assume  that  node  3  has  failed 
in  group- 1  and  nodes  4  and  7  have  failed  in  group-2  in  Figure  6.3.  Hence,  the  two 
working  nodes  5  and  6  are  disconnected  in  group-2.  But  all  the  five  nodes  are  working 
connected  due  to  the  hypercube  topology.  This  situation  can  be  expressed  as 

P(Xn-i  =  k)P{Cn-i  Jij-  k\Xn.l  =j-k){l  -  P{Dn-i  (kj-k)  fj))  (6.6) 

The  second  term  in  equation  (6.6)  gives  the  conditional  probability  that  (j-k) 
working  nodes  are  disconnected.  The  term  P{Dn~  \  (k,j  k)  f  j)  gives  the  probability 
that  the  total  connectivity  of  the  k  connected  nodes  from  one  (n-l)-cube  and  (j-k) 
disconnected  nodes  from  the  second  (n-l)-cube  is  not  j.  The  probability  of  connectivity 
being  exactly  j  is  obtained  by  subtracting  this  value  from  1. 

We  assume  that  k  >  ( j  k),  i.e.,  the  group  with  fewer  nodes  is  the  disconnected 
group,  since  the  probability  of  disconnection  decreases  with  increasing  number  of  nodes 
in  a  group. 

Distribution  III 

The  third  possibility  is  that  j  connected  nodes  are  working  in  one  (n-l)-cube, 
s  processors  are  working  in  the  second  (n-l)-cube,  but  the  two  groups  are  totally 
disconnected.  For  example,  assume  that  nodes  {1,3}  have  failed  in  group-1  and  nodes 
{4,6}  have  failed  in  group-2  in  Figure  6.3.  If  j=2,  group-1,  with  2  connected  nodes 
{0,2},  satisfies  the  task  requirement.  There  are  three  possibilities  in  group-2:  only  5 
works,  only  7  works,  or  both  5  and  7  work.  Any  of  these  three  possibilities  can  not 
increase  the  total  number  of  connected  nodes  in  the  system,  as  the  two  groups  are 
always  disconnected.  Obviously,  this  kind  of  situation  occurs  only  if  j  <  2n~l.  If 
j  =  2n_1,  the  two  groups  are  always  connected.  Also  distribution  III  can  not  occur 
if  j  >  2n~i .  From  the  3-cube  in  Figure  6.3,  we  find  that  there  are  min(j,  2"~  1  -  j) 
positions  for  s  in  group-2  that  are  always  disconnected  from  the  j  positions  in  group-1. 


I 


I 


For  example,  if  j-3  in  group- 1,  there  is  only  one  position  in  group-2  that  is  disconnected 
from  group-1.  For  j  =  2,  there  are  only  2  nodes  in  group-2  that  are  disconnected  from 
group- 1.  The  probability  of  distribution  III  can  be  approximated  as 


mi  n  ( j  1 2  f 


X,  P{Xn- l=j) 


(0'(1  -  «n(0)S 


,  /or  j  <  2r 


This  equation  gives  nearly  exact  probability  when  j  is  close  to  but  less  than  2n_1 . 
As  the  difference  between  j  and  2n_1  increases,  equations  (6.7)  becomes  less  accurate, 
since  we  are  not  choosing  s  from  all  possible  (2n_  1  -  j)  positions.  The  factor  2  appears 
in  equation  (6.7)  since  the  j  connected  nodes  can  be  in  either  of  the  two  (n-l)-groups. 

Distribution  IV 

This  last  case  depicts  a  situation  where  some  k  nodes,  k  >  j,  are  working  in  the 
n-cube,  but  only  j  of  them  are  connected  with  the  two  groups  not  totally  disconnected. 
This  is  the  reverse  case  of  distribution  III,  where  the  two  groups  are  disconnected.  For 
example,  suppose  that  nodes  {0, 3, 6}  have  failed  in  Figure  6.3  of  the  3-cube.  This 
leaves  5  working  processors  in  the  system,  of  which  only  4  are  connected;  node  2  is 
disconnected.  We  represent  this  case  as 

N 

Y,  P(C"  =  j\Xn  =  fc)  (6.8) 

k  =  j  +  1 

where  N  is  the  total  number  of  nodes  in  the  hypercube.  Equation  (6.8)  represents  the 
probability  that  j  nodes  are  connected  from  k  working  nodes. 

Now,  by  combining  all  the  four  cases,  the  approximate  equation  for  j  connected 
nodes  is  given  by 


P(Xn  -  j)  =  E  P(Xn- i  =  k)P(Xn.  1  -  J  ~  k)(l  -  P(Cn^(k,j  -k)±  j)) 

k-m 

M 

+  E  P(X„- ,  =  k)P{Cn-\  j  -  k\Xn- ,  =/-*)(! -/>(£>„_,(*,/-*)  ?j)) 

k  =  m 


!! 
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mtn().  2n  -j) 

*2  j:  =  j)(2"  j  -  j?„(<))a" 1  • 

s  =  1 

+  £  P(C„  =  j\Xn  -  k)  (6.9) 

k=}+  1 

where  m  and  M  are  given  by  m  -  mai(Q,j  -  2n  ')  and  M  -  min{2n~l,j).  These 
two  values  determine  the  lower  and  upper  bounds  of  k  in  an  (n-l)-cube. 

The  second  term  in  equation  (6.9)  is  used  if  k  >  j  -  k.  Otherwise,  this  expression 
is  evaluated  as 


X>(AV,  -  j~k)P(Cn  ,  ±k\Xn->  =k)(l-P(Dn-1(j-k,k)?j)) 

k  m 

Also,  as  explained  under  distribution  III,  the  third  term  is  evaluated  if  j  <  2n~ 1 .  It 
should  be  observed  that  equation  (6.9)  is  a  recursive  expression;  the  n-cube  probability 
is  derived  from  (n-l)-cube  probability.  The  recursion  is  continued  until  (n-l)=2  or  3. 

6.3.2  Term  Evaluation 

There  are  four  different  probability  terms  in  equation  (6.9)  that  need  to  be 
quantified.  These  are  P(C„-,(Ar,j  -  k)  i  j ),  P{Cn~i  ^  j  -  k\Xn-  j  =  j  -  k), 
P[Dn  i  {k..j  -  k)  /  j),  and  P(Cn  =  j  Xn  —  k).  We  address  these  terms  below. 

I)  P(C„  .,(fc,j  -k)?  j) 

If  J  >  2"  1 ,  the  probability  of  disconnection  between  two  connected  groups  of  k 
and  (j-k)  nodes  from  the  two  (n-l)-cubes  is  zero.  For  example,  let  j— 5,  k=3,  and  n=3. 
Because  k  and  (j-k)  nodes  are  connected  in  the  two  groups,  there  must  be  at  least  one 
link  that  connects  the  two  groups. 

The  probability  of  disconnection  is  non-zero  when  j  <  2n~1.  If  we  choose  j-k 
connected  processors  from  one  (n-l)-cube  such  that  (j-  k)  <  2n-2,  then  the  remaining 
(2n  1  j  +  k)  nodes  are  also  connected,  considering  no  failure.  On  the  other  hand,  if 
(j  k)  >  2" the  (2n^J  -  j  f  k)  nodes  are  not  always  connected.  In  other  words, 
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when  more  than  half  of  the  nodes  are  connected  in  a  hypercube,  the  rest  of  the  nodes 
may  be  dispersed  on  the  various  vertices  of  the  cube  without  being  connected.  Hence, 
if  we  assume  that  (j-k)  is  a  small  number,  k  >  j  —  k ,  then  (2n  1  -  ]  +  k)  nodes  will 
be  connected. 

Now,  if  we  assume  that  (j-k)  nodes  are  connected  in  one  (n-l)-cube,  then  there 
are  (2"_  1  -  j  +  k)  counterpart  positions  in  the  second  (n-l)-cube  which  have  no  direct 
connection  to  the  j-k  nodes  of  the  first  group.  For  example,  if  {5,7}  are  the  j-k  nods 
in  group-2  of  Figure  6.3.  then  nodes  {0, 2}  are  not  connected  to  5,  and  7.  We  will  refer 
to  nodes  0  and  2  as  the  counterparts  of  nodes  5  and  7.  Nodes  {1,3}  have  connection 
to  5  and  7  directly. 

Since  we  are  looking  for  k  and  (j-k)  nodes  to  be  disconnected,  we  can  guarantee  this 
situation  if  the  k  connected  nodes  are  now’  chosen  from  the  (2n  -  j  +  Jfc)  counterpart 

positions  in  the  second  group.  Hence,  we  can  w’rite 
P(Cn  ,(fc.j  k)^j)  = 


(no  of  ( j—k )  processors  connected)  * 

(no.  of  k  processors  connected  from  (2 n~ 1 -j-k)  connected) 
(no  of  (j-k)  processors  connected )» 

(no.  of  k  processors  connected) 


if  j  >  2' 


if  j  >  2T 


if  J  <  2r 


■  Cc(2n~1  —  j  +  k,k)  , 

Gc(2»->,*)  l}  J  - 


(6.10) 


where  Gc(2n~1  ,k)  gives  the  number  of  k  connected  processors  from  one  (n-l)-cube 
and  Cc( 2n~‘  -  j  +  k,k)  gives  the  number  of  k  connected  processors  from  among 
(2"_1  -  j  +  k)  connected  processors.  We  determine  the  disconnected  probability  by 
dividing  the  number  of  disconnected  combinations  by  the  total  number  of  possible 
connections  from  among  two  (n-l)-cubes. 

The  exact  evaluation  of  the  term  Cc( 2n~  1  ~j+k,k)  is  extremely  difficult.  However, 
after  examining  several  cases,  we  approximate  this  term  as 


Cc( 2"-1  ->  +  *,*)  «  2"-]  -  j  +  1  if  j  <  2n_1 


(6.11) 
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The  validity  of  this  approximation  will  be  discussed  when  we  analyze  this  analyt¬ 
ical  results  in  section  5. 

Evaluation  of  the  denominator  in  equation  (6.10)  is  also  complex,  since  a  simple 
(2  k  )  dues  not  guarantee  connected  nodes.  We  know  /’(A’n-i  -  j)  both  for  the  base 
model  and  for  higher  order  models.  Hence,  we  approximate 


P{Xn-.1-j)*Gc{2"  '.j)Rn(ty(  1  Rn(t)): 


(6.12) 


This  is  an  approximation,  since  we  could  also  get  j  connected  working  nodes  from 
more  than  j  working  nodes,  as  discussed  in  distributions  III  and  IV. 

Finally,  if  the  k  ■  j  k  condition  is  not  satisfied,  we  can  change  the  order  of 
evaluation  as  (j  -  k)  >  k  in  order  to  satisfy  all  values  of  k  from  m  to  M  in  equation 
(6.9). 


II)  P{C„-i  +  j  k'Xn-i  ~  j  -  k) 

Here  we  are  interested  in  determining  the  probability  that  there  are  (j-k)  pro¬ 
cessors  working  in  the  (n-l)-cube,  but  all  the  (j-k)  nodes  are  not  connected.  This 
probability  can  be  expressed  as 
P(Cn  -  i  /  j  k  X„  i  -  j  -  k)  = 

(]n  k)Rn(ty~k{  1  -  Rn{t))*n-'~:>+k  -  P{Xn-i  =j-k)  (6.13) 

The  first  term  in  equation  (6.13)  represents  the  probability  of  all  possible  combi¬ 
nations  of  (j-k)  nodes  from  among  2"_1  nodes.  The  second  term  gives  the  probability 
that  all  the  (j-k)  nodes  are  connected.  Subtracting  the  second  term  from  the  first 
term,  we  get  the  probability  when  connectivity  is  not  equal  to  j-k. 

III)  P{Dn-  \  (k,j  -  k)  y  j) 

Here,  we  are  interested  in  finding  the  probability  that  k  connected  nodes  from 
one  (n-1)  cube  and  (j-k)  disconnected  nodes  from  the  second  cube  are  not  j  connected. 
If  k  2n  *,  this  probability  is  0,  since  all  the  disconnected  (j-k)  nodes  of  one  group 
are  connected  to  k  nodes  of  the  other  group.  Also,  if  j-k  —  0  or  1,  then  there  is  no 
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disconnection  possibility.  It  is  only  when  k  <  2"  1  and  {j-k)  >  1  that  this  probability 


is  non-zero. 

Since  k  nodes  are  working  in  one  (n-l)-cube,  there  are  (2n_1  -  k)  counterpart 
positions  in  the  second  (n-l)-cube  that  have  no  connection  with  the  k  nodes  of  the 
first  group.  If  we  choose  s  nodes  from  these  positions,  then  the  remaining  (j-k-s)  nodes 
must  be  disconnected  from  s  to  satisfy  the  condition  that  (j-k)  nodes  be  disconnected. 
Hence,  we  write  P(D„_  i  (k,  j  -  k)  ^  j) 


*(2' 


fc-1) 


/ 

I 


(j  node •  in  (2n_l  -  k)  position*)* 

(j  —  k-»  node »  from  available  pontioni)* 
(probability  that  •  and  ( j  —  fc  —  • )  are  disconnected) 


(no.  of  (j—k)  nodes  disconnected) 


rn  i  n  (  2  71 


k  ,j 


|  ’  1  ~Gd(2*-\j-k)  (614) 

where  total  -  min(2n~l  -  s*n,k)  is  the  number  of  available  positions  from  which 
to  choose  the  (j-k-s)  positions.  We  choose  the  minimum  of  the  two  terms,  because  if 
k  <  ( 2n~l  -  s  *  n),  there  are  some  positions  in  2n~1  —  s  *  n  which  are  disjoint  to  k 
positions  in  the  first  group.  This  argument  is  based  on  the  fact  that  k  positions  in  one 
(n-l)-cube  are  connected  to  exactly  k  positions  in  the  second  (n-l)-cube.  Hence,  if  we 
choose  (j-k-s)  from  2n~ 1  -  s  *  n,  where  (2”_1  -  s*n)  >  k,  there  is  a  possibility  that  we 
can  have  three  disjoint  groups:  k,  s,  (j-k-s).  Since  we  consider  the  complete  disjoint 
case  in  distribution  III,  these  terms  should  not  be  included  in  distribution  II.  The  term 
Gd( 2”  \j  -  k)  gives  the  total  number  of  cases  where  (j-k)  nodes  are  disconnected. 
This  can  be  obtained  from  equation  (6.13)  by  the  following  approximation. 


P(Cn -  1  /  j  -  k \xn-l  =j-k)K  Gd{2n~\j  -  k)Rn(ty~k(  1  -  Rn{t)yn~1~j+k 

(6.15) 

The  parameter  s  can  vary  from  1  to  (j-k-1)  so  that  s  and  (j-k-s)  can  be  considered  as 
two  disconnected  groups. 

The  term  Is , j  -k-s  represents  the  disconnected  probability  of  s  and  (j-k-s)  nodes. 
If  we  choose  s  nodes  from  (2n_1  -  k)  counterpart  positions,  all  the  neighboring  nodes 
of  s  can  not  be  selected  for  disconnected  positions.  For  example,  if  we  choose  node  0 


a»  in  l  .gur<  6.3.  ih»-i  •  ••iincctcd  nodes  1  and  2  to  0  ran  not  lx  u-ed  as  disconnected 
positions  Only  node  3  can  be  selected  for  (j-k-s).  Hence,  the  number  of  available 
disconnected  positions  must,  be  greater  than  or  equal  to  (j-k-s)  for  a  connection  between 
s  and  (j-k-s)  to  exist  On  the  other  hand,  when  the  number  of  available  disconnected 
positions  is  less  than  (j-k-s).  some  of  the  connected  positions  of  s  will  be  used  for  (j-k) 

positions.  In  this  case,  there  is  no  disconnection.  We  write  this  function  as 

f  0  if  2"  1  -s*n<j-k  t> 

h,  k  s  ,  „  ,  ,  (6.16) 

|  1  if  2  $  *  n  >  j  -  k  s 

It  should  bt  observed  that  for  a  given  s.  s’  n  positions  are  not  available  for  dis¬ 


connect'.  >n  where  n  is  the  size  of  the  cube.  This  leaves  onlv  (2" 


n)  position  in 


l 

i 


the  (n-1  (-cube  1< »r  choosing  the  remaining  (j-k-s)  positions.  2"  1  •  >  *  n  >  [j  k  s) 
assures  that  s  and  j-k-s  are  disconnected. 

IV)  P{Cn  j\Xn  -  k)  ,  j  <  k<  2"  1 

Let  us  assume  that  there  are  s  connected  processors  working  in  group- 1  and  (k-s) 
processors  working  group-2.  Out  of  the  (k-s),  only  (j-s)  are  connected  to  group-1, 
and  (k-j)  are  disconnected  from  group-1.  The  upper  value  of  s  is  given  by  U  = 
rnin(  2"  1  -].j  1),  since  if  .s  -  2n  ! .  all  the  working  nodes  from  group- 1  are  connected 
to  all  the  (k-s)  working  nodes  of  group-2.  If  s=j,  or  j=  1  then  all  the  working  nodes  are 
in  one  group.  This  distribution  then  becomes  identical  to  distribution  III.  Hence,  the 
lesser  of  2n  1  1  and  j-1  is  the  upper  bound  for  one  working  group.  The  lower  bound 

for  s  is  given  by  L  -  mai(l,k  -  (2n^]  -  1)).  since  no  more  than  (2n  1  1)  nodes 

can  be  working  in  the  second  group  if  the  two  groups  are  disconnected.  This  leaves  us 
with  k  -  (2n  1  1)  nodes  to  start  within  the  first  group.  Therefore,  the  minimum  of 

1  and  k  -  (2n  1  1)  gives  the  lower  bound  for  s. 

Now,  the  situation  is  that  s  connected  nodes  are  in  group- 1,  (k-s)  disconnected 
nodes  are  in  group-2  and  the  connectivity  between  the  two  groups  is  j.  This  is  expressed 
as 


P{Cn  ---  jXn  ~  k)  ^  2  v  P(Xn  ,  -  s)P(Cn  ,  i  k  -S  xn  ,  =k-s) 

8  L 
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P(Z>n_iM  -  s)  -  j)) 


(6.17) 


The  first  two  terms  of  equation  (6.17)  are  already  known.  We  have  to  evaluate  only 
the  third  term.  This  can  be  derived  using  the  same  argument  as  for  term  III,  since 
P(Dn  i  (s,k  s)  f  k)  can  be  written  as  the  summation  of  all  P(Dn~i(s,k  -  s)  —  i) 
terms  for  a  <  i  <  k.  However,  we  are  interested  only  in  j  connections  between  s  and 
k-j. 


Since  there  are  s  processors  working  in  the  first  group,  there  are  (2n_1  -  s)  posi¬ 
tions  in  the  second  group  that  are  not  connected  to  those  s  positions.  As  (k-j)  positions 
are  always  disconnected  from  j  positions,  we  choose  (k-j)  from  (2n_1  .s).  The  num¬ 

ber  of  ways  in  which  we  can  choose  (k-s)  disconnected  nodes  from  2n~’  is  given  by 
6'd(2”  1 ,  k  s).  We  then  write  the  third  term  in  equation  (6.17)  as 


P{Dn-.  i  (s,k  -  s)  =  j) 


c 


"  1  -  s\  /iofa/1 

k - j  s 

Od(2n-' 


k  -  s) 


(6.18) 


where  total  1  triin(2n  1  {k  j)  *  n,s)  is  the  number  of  available  positions  to 
choose  the  remaining  (j-s)  positions  which  are  connected  to  the  s  positions  of  the  first 
group.  The  term  total  1  is  computed  based  on  the  same  argument  as  for  the  term 
total  in  equation  (6.14).  I k  h]  s  gives  the  probability  that  there  is  no  connection 
between  (k-j)  and  (j-s)  nodes  in  the  second  group.  If  we  can  guarantee  that  there  is  no 
connection  between  (k-j)  and  (j-s)  nodes,  then  we  have  j  connected  nodes  from  k.  This 
is  because  the  (j-s)  nodes  in  the  second  group  are  selected  such  that  there  is  always 
some  connection  to  the  first  group  s  nodes.  More  specifically,  for  each  s  positions  in 
one  (n-l)-cube,  there  are  s  positions  in  the  second  (n-l)-cube  which  are  connected. 


The  term  Ik 


h) 


h 


is  given  from  equation  (6.16)  as 

j  0  if  2"  1  {k  j )  *  n  <  j 

3J  ‘  \  1  if  2n  !  ( k  j)  *  n  >  j 


(6.19) 


Equation  (6.19)  shows  that  when  the  number  of  disconnected  positions  are  less 
then  required  (j-s).  the  disconnection  probability  is  0.  Otherwise  (j-s)  and  (k-j)  are 
always  disconnected. 


All  the  terms  in  equation  (6.9)  are  now  quantified  to  derive  P(Xn  —  j).  The 
system  reliability  can  be  computed  from  equation  (6.4). 

We  expect  that  equation  (6.9)  will  give  fairly  accurate  results  when  the  value 
of  j  is  close  to  N  2,  since  most  of  the  working  nodes  are  connected  when  j  is  large. 
Equations  (6.12)  and  (6.15)  give  better  approximation  in  this  case.  When  j  is  less 
than  N  2.  the  probability  of  a  larger  number  of  disconnected  working  nodes  increases. 
Equations  (6.12)  and  (6.15)  deviate  more  from  the  reality  in  this  situation. 


6.3.3  Modified  Method 

We  can  divide  equation  (6.9) into  two  parts  depending  on  the  actual  number  of 
working  nodes.  The  first  two  terms  of  equation  (6.9)  give  the  probability  of  j  connected 
nodes  when  exactly  j  node  are  working  in  the  system.  The  second  two  terms  give  j 
connected  nodes  from  k  working  nodes  where  k  >  j.  Hence,  we  can  rewrite  P(Xn  =  j) 


P(X„  -  jY  =  P(Xn  =  j)  +  Y,  P(C"  =  J\Xn  =  k)* 


(6.20) 


We  evaluate  the  first  term  of  equation  (6.20)  for  all  the  j  connected  nodes.  This 
is  computed  recursively  from  the  first  two  terms  of  equation  (6.9).  After  expanding 
the  first  two  terms  of  equation  (6.9)  and  simplifying,  we  get 


P(Xn  -  J )  E  P{X n->  -  k)G(2n  ',k-J,P) 

k  —  m 
M 

-  E  P{Xn  \  J  k)Cc(2n-'  -  j  +  k,k)Rn(t)k{l  -  Rn{t))r 

k  -  m 

E  P(Xn  ,  k)P( Dn  _ ,  ( fc ,  j  -  k)  1  j)Gd( 2n~\j  -  k) 

k  m 

*Rn(t)}  k(l  -  Bnil))2”  '  J  +  k 


(6.21) 


The  second  term  of  equation  (6.20)  is  computed  assuming  that  all  the  disconnected 
nodes  are  present  only  one  level  lower  than  the  n-cube,  i.e.,  the  computation  is  done 


at  the  (n-l)-cube  level.  It  does  not  go  recursively  down  to  the  base  model.  We  write 
this  as 

p{cn  =  j  xn  =  ky  =  P(cn  =  j \xn  =  k) 

+  2P(A'n_ j  =  j)(2nkl;})Pn(t)k-}(  1  -  (6.22) 

The  first  term  of  equation  (6.22)  is  the  same  as  equation  (6.17).  However,  this  is 
required  only  at  the  (n-l)-cube  level.  The  second  term  is  for  distribution  III,  given  by 
equation  (6.7).  This  term  is  valid  for  j  <  2n_1  and  k  <  2n_1. 

Since  all  the  terms  required  for  equation  (6.22)  at  the  (n-l)-cube  level,  are  available 
from  the  previously  computed  terms  of  equation  (6.21),  this  modified  method  is  faster 
than  equation  (6.9). 

6.4  Results  and  Discussion 

The  original  equation  (6.9)  or  the  modified  equation  (6.20)  can  be  used  to  compute 
P(Xn  =  j).  Most  of  this  results  are  based  on  a  node  failure  rate  of  250  in  106  hours; 
i.e.  A  =  0.00025.  While  analytical  results  for  hypercubes  of  any  size  can  be  computed 
using  this  model,  we  are  including  here  a  few  results  because  of  space  limit.  Figure 
6.4  shows  the  reliability  variation  with  time  of  a  6-dimensional  hypercube  for  different 
task  requirements.  The  solid  curves  are  for  the  analytical  results  using  the  modified 
technique.  The  dotted  curves  are  for  simulation  results.  The  analytical  results  are 
computed  using  2-cube  base  model.  However,  the  results  are  almost  the  same  for  both 
the  2-cube  and  3-cube.  It  can  be  observed  from  Figure  6.4  that  the  analytical  and 
simulation  results  match  closely  for  1=48,  32,  and  16. 

Figure  6.5  shows  the  reliability  variation  of  a  7-cube  system  under  different  task 
requirements.  The  analytical  and  simulation  curves  match  closely  for  25%,  50%,  and 
75%  node  degradation  (1=96,  64,  32).  The  difference  between  the  analytical  and 
simulation  results  is  less  than  6%.  In  Figure  6.6,  the  reliability  variation  of  8-cube 
system  for  1=64,  128,  and  192  is  given.  The  results  match  very  closely  with  that  of 
simulation. 


CHAPTER  7 


CONCLUSIONS 

This  report  is  intended  to  summarize  our  research  efforts  in  evaluating  parallel 
architectures  for  BM/C  applications.  The  efforts  are  focused  on  defining  evaluation 
criteria,  developing  tools  and  performing  analysis  in  determining  the  suitability  of 
existing  parallel  computer  architectures  to  provide  optimal  processing  environments 
for  the  BM/C  applications.  Our  research  work  consists  of  three  major  components: 
(1)  development  of  a  BBN  Butterfly  performance  predictor,  (2)  mapping  of  the  Bat¬ 
tle  Management  Algorithm  to  the  Butterfly  Parallel  Computer,  and  (3)  performance 
and  dependability  evaluators  for  the  Butterfly  parallel  computer  and  the  Hypercube 
multiprocessor. 

The  main  component  of  this  Butterfly  Performance  Predictor  is  a  program  that 
simulates  program  execution  on  a  Butterfly  while  accumulating  performance  measures. 
Performance  estimates  are  made  based  on  instruction  execution  rate  information  de¬ 
rived  from  Motorola  data  books  relating  to  the  basic  hardware  components  of  the 
Butterfly  'romputer.  To  derive  the  simulator  under  conditions  representative  of  the 
target  application  domain,  a  second  program  generates  synthetic  instruction  streams 
that  are  representative  of  the  target  application  domain.  The  first  version  of  the  But¬ 
terfly  simulator  is  under  development  in  the  programming  language  C  on  a  SUN  3/50 
workstation  running  4.2BSD  UNIX. 

The  Battle  Management  Algorithm,  formulated  as  a  linear  programming  algo¬ 
rithm,  is  a  problem  that  in  nature  requires  intensive  interprocess  communication  for 
simultaneous  process  execution.  Our  attempt  is  to  minimize  the  contention  costs  both 
in  shared  memories  and  in  communication  links  by  setting  up  a  tree-shape  communi¬ 
cation  structure  among  processor  nodes.  The  tree-shape  communication  structure  is 
used  for  searching  a  minimum  value  in  one  computational  phase,  and  for  broadcasting 


it  in  another  computational  phase.  The  proposed  method  is  also  applicable  to  other  al¬ 
gorithms  with  similar  data-flow  graphs.  Therefore,  with  the  advantages  of  minimizing 
contention  costs,  the  algorithm-based  method  leads  to  a  new  technique  for  mapping 
algorithms  onto  to-date  parallel  processors.  Evaluation  of  the  parallel  algorithms  is 
accomplished  by  a  simplified  mathematical  analysis  for  approximately  predicting  the 
system  performance.  The  result  indicates  that  if  the  proposed  method  is  used  to  map 
the  linear  programming  algorithms  of  size  n  onto  the  Butter  fly™  parallel  processor, 
an  .  speedup  can  be  achieved. 

0\niog^r\)  1  r 

The  novelty  of  the  reliability  evaluation  of  the  Butterfly  network  mainly  lies  in 
the  use  of  an  analytical  model  for  precise  definition  and  accurate  analysis.  Reliability 
evaluation  of  multiprocessor  systems  using  Butterfly  type  network  is  addressed.  The 
Butterfly  network  is  a  multistage  network  designed  out  of  4x4  switches.  The  novelty 
of  the  evaluation  technique  is  the  development  of  an  analytical  model  for  reliability 
computation.  The  model  is  based  on  a  decomposition  technique.  Using  this  technique, 
the  reliability  of  a  64  processor  and  64  memory  configuration,  (64x64),  is  computed 
from  four  (16x16)  system  reliability.  The  (16x16)  reliability  in  turn  is  computed  from 
four  (4x4)  reliability.  The  reliability  model  is  known  as  task  based  reliability,  where  a 
system  remains  operational  as  long  as  a  task  can  be  executed  on  the  system.  The  failure 
of  the  PEs,  MMs.  and  SEs  are  included  in  the  analysis  to  consider  a  complete  system. 
The  model  is  suitable  for  the  analysis  of  medium  size  systems,  such  as  (16x16)  and 
(64x64)  multiprocessors.  While  a  (256x256)  configuration  could  be  analyzed  using  our 
model,  the  computation  time  is  a  major  concern  unless  some  approximation  technique 
is  used  to  simplify  the  swdtch  connection.  We  are  currently  investigating  in  applying 
approximate  methods  to  improve  the  computation  efficiency. 

We  have  also  devised  a  new  analytical  technique  to  compute  the  reliability  of 
an  n-dimensional  hypercube.  The  model  is  based  on  the  decomposition  principle.  A 
recursive  equation  is  derived  to  compute  n-cube  reliability  based  on  2-cube  or  3-cube 
base  models.  The  model  is  developed  by  considering  four  different  situations,  where 
the  required  number  of  connected  nodes  are  working  on  the  system.  Analytical  results 


for  various  hypercubes  are  compared  with  simulation  results  to  show  that  they  are  in 
close  agreement.  Several  extensions  of  this  model  are  presently  under  investigation. 
The  immediate  extension  is  to  apply  this  model  to  repairable  hypercubes  to  compute 
transient  and  steady  state  availability.  Also,  inclusion  of  link  failure  in  the  base  model 
and  between  two  base  models  should  allow  us  to  consider  both  the  node  and  link 
failures. 

Performance  and  dependability  evaluations  are  essential  for  any  system  character¬ 
ization.  Two  types  of  parallel  systems,  namely;  Butterfly  and  Hypercube  are  studied 
here.  It  is  pointed  out  that  there  is  no  existing  tool  available  for  the  performance 
evaluation  of  these  machines  considering  both  the  architecture  and  application  algo¬ 
rithms.  Similarly  none  of  the  existing  dependability  packages  can  be  applied  to  the 
above  systems  directly.  These  observations  clearly  dictate  the  necessity  of  developing 
evaluation  tools  for  these  architectures.  Preliminary  results  of  the  Butterfly  perfor¬ 
mance  predictor,  Butterfly  and  Hypercube  reliability  tools  are  included  in  this  report. 
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